July 5, 2024
AI Agents Utilized for Explaining Other AI Systems

AI Agents Utilized for Explaining Other AI Systems

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a new approach that utilizes AI models to experiment on and explain the behavior of other systems. This method incorporates automated interpretability agents (AIAs) built from pretrained language models to produce intuitive explanations of computations within trained networks.

The AIAs are designed to mimic the experimental processes of scientists, allowing them to plan and execute tests on computational systems and generate explanations in various forms. These explanations include language descriptions of system behavior and failures, as well as code that replicates the system’s operations.

The researchers have also developed a function interpretation and description (FIND) benchmark to evaluate the quality of descriptions of real-world network components. FIND consists of a variety of functions that resemble computations inside trained networks, along with descriptions of their behavior. The benchmark enables the comparison of AIAs with other interpretability methods.

One of the challenges in evaluating descriptions is the lack of access to ground-truth labels or descriptions of learned computations. The FIND benchmark addresses this issue by providing a reliable standard for evaluating interpretability procedures. Explanations of functions can be assessed against the function descriptions in the benchmark.

The AIAs have the potential to surface behaviors that might be difficult for scientists to detect. The language models equipped with tools for probing other systems can effectively design experiments, enabling a deeper understanding of complex AI systems.

The researchers at CSAIL recognize that interpretability is a multifaceted field, and existing approaches are specific to individual questions and modalities like vision or language. Interpretability agents built from language models, however, provide a general interface for explaining other systems, integrating results across experiments and different modalities.

As the models used for explaining become black boxes themselves, external evaluations of interpretability methods become increasingly important. The researchers’ benchmark addresses this need by providing a suite of functions modeled after real-world behavior. The benchmark introduces complexity to simple functions by incorporating noise, composing functions, and simulating biases, allowing for a realistic evaluation of interpretability methods.

The effectiveness of the AIAs and existing interpretability methods is evaluated using an innovative protocol. For tasks that require replicating functions in code, the AI-generated estimations are compared to the ground-truth functions. For tasks involving natural language descriptions of functions, the researchers developed a specialized language model that evaluates the accuracy and coherence of the AI-generated descriptions.

While the AIAs outperform existing interpretability approaches, they still struggle to accurately describe almost half of the functions in the benchmark. This is often due to insufficient sampling in areas with noise or irregular behavior. To address this issue, the researchers guided the AIAs’ exploration by initializing their search with specific, relevant inputs, which significantly improved interpretation accuracy.

The researchers are also developing a toolkit to enhance the AIAs’ ability to conduct precise experiments on neural networks. This toolkit aims to provide better tools for input selection and hypothesis testing, enabling more nuanced and accurate analysis of neural networks.

Ultimately, the goal is to develop automated interpretability procedures that can help audit systems in real-world scenarios. These procedures would diagnose potential failure modes, hidden biases, or unexpected behaviors in systems like autonomous driving or face recognition. The researchers envision AIAs that can autonomously audit other systems, with human scientists providing oversight and guidance.

By expanding AI interpretability to include complex behaviors and predicting undesired behaviors, this research is a significant step towards making AI systems more understandable and reliable. The new benchmark and AI agents offer sophisticated tools for addressing one of the biggest challenges in machine learning—interpretability. The automated interpretability agents, in particular, demonstrate the potential of turning AI back on itself to enhance human understanding.

*Note:
1.      Source: Coherent Market Insights, Public sources, Desk research
2.      We have leveraged AI tools to mine information and compile it