Notes on 2412.07781 layperson-understandable LLM 'self'

Published on December 13, 2024

אם ירצה ה׳

Status:Completed Date: 14 Kislev 5785

Das et al (2024) "Can LLMs faithfully generate their layperson-understandable self ? A Case Study in High-Stakes Domains"

Full Disclosure: dyb maintains a friendly professional relationship with co-author V.G., which may introduce a potential positive bias in the analysis.

TL;DR:

Layman TL;DR:

This research helps us understand how AI models like ChatGPT make decisions in important areas like law, healthcare, and finance. They developed a method to get these models to explain their thinking step-by-step, making the AI more trustworthy. The approach achieves over 90% consistency in legal and medical scenarios.

AI/Machine Learning Enthusiast TL;DR:

A novel prompting method "ReQuesting" that compels LLMs to generate algorithms explaining their reasoning in high-stakes domains (law, health, finance). This method achieves high reproducibility, with PerRR scores often exceeding 90% in legal and health domains, indicating effective performance capture. However, prediction-level reproducibility (PreRR) varies, necessitating further investigation into LLM prediction consistency.

Domain Expert (Law, Health, Finance) TL;DR:

"ReQuesting" is a method that enhances the explainability of LLMs in law, health, and finance. By prompting LLMs to generate algorithms reflecting their decision-making, the method achieves high performance reproducibility (PerRR scores >90%). This offers a more transparent way to utilize LLMs, but variability in individual prediction consistency (PreRR) requires careful consideration in critical applications.

Non-Hype (Skeptic) TL;DR:

While "ReQuesting" is presented as a method for generating explainable algorithms from LLMs, the results are preliminary and domain-specific. High overall reproducibility (PerRR) does not guarantee consistent individual predictions (PreRR varies considerably). The method's reliance on specific LLMs and datasets, along with zero-shot prompting may not generalize and limit practical applicability in real-world scenarios. The "explainable" algorithms may also oversimplify the LLM's true internal workings.

Application:

Problem: Large Language Models (LLMs) are increasingly used in critical areas like law, healthcare, and finance, but their complex inner workings are often opaque. This "black box" nature makes it hard for people, especially non-experts, to understand and trust their decisions, especially when the stakes are high.

Solution: This research introduces a method called "ReQuesting." It's like asking the LLM to explain its thinking in a way that a regular person can understand. ReQuesting involves prompting the LLM to generate a step-by-step algorithm that mirrors how it arrived at a particular answer. This algorithm is then tested for consistency and accuracy. For example, in the legal domain, the ReQuesting method achieved PerRR scores of 98.11% (Gemini) and 91.39% (Llama) for statute prediction, indicating high reproducibility of the LLM's performance when using the generated algorithm.

Unexpected Findings:

High Reproducibility in Legal Domain: The ReQuesting method showed surprisingly high reproducibility in the legal domain, especially when the same LLM was used to generate and execute the algorithm (intra-LLM setup). For instance, in statute prediction, PerRR values reached 98.11% for Gemini and 91.39% for Llama, while in human rights prediction, PerRR values were even higher, often exceeding 95% for both LLMs.
- Rationale: This is significant because it implies that LLMs can be used more confidently in legal settings, where understanding the basis of a decision is crucial. The high PerRR scores suggest that the generated algorithms effectively capture the LLM's reasoning process in this domain.
Variability in Prediction-Level Reproducibility: While overall performance reproducibility was often high (PerRR scores frequently above 90%), the prediction-level reproducibility (measured by PreRR) showed more variability across different tasks and LLMs. For example, in statute prediction, PreRR values ranged from 0.3880 to 0.6083, while in the health domain, PreRR values for suicide risk assessment were around 0.6-0.7 but lower for depression severity detection.
- Rationale: This is concerning because it suggests that even when LLMs produce similar overall results, their specific predictions on individual cases might differ considerably. This highlights a potential lack of deterministic reasoning at a granular level, which could be problematic in high-stakes applications where individual predictions have significant consequences.
Better Performance of Algorithm from Weaker LLM: In the health domain, the algorithm generated by a weaker LLM (llama3) sometimes outperformed the algorithm from a stronger LLM (gemini) when applied to new data. For instance, in the inter-LLM setup for suicide risk assessment, the algorithm generated by Llama3 achieved a PerRR_LA_G score of 94.99%, outperforming the algorithm generated by Gemini.
- Rationale: This is surprising because it suggests that the complexity of a model doesn't always correlate with the quality of its explainable output. It raises questions about how different LLMs internalize and represent knowledge and suggests that simpler models might sometimes produce more transparent and generalizable explanations.

Key Terms:

Large Language Model (LLM): A type of artificial intelligence program that can understand and generate human-like text. Examples include Gemini and LLaMA.
ReQuesting: A new method proposed in this paper where an LLM is prompted to generate an algorithm explaining its reasoning for a given task.
Faithfulness: In this context, it refers to how accurately the generated algorithm reflects the LLM's actual internal process.
Reproducibility: The ability of an LLM to produce consistent results when given the same task or when using the generated algorithm. Measured quantitatively using PerRR and PreRR.
Intra-LLM Setup: Experiments where the same LLM is used to generate and execute the algorithm.
Inter-LLM Setup: Experiments where the algorithm generated by one LLM is executed on a different LLM.
Macro F1 Score: A measure of a model's overall performance, considering both precision and recall. Ranges from 0 to 1, with higher values indicating better performance.
Jaccard Score: A measure of similarity between two sets of data, used here to assess prediction-level reproducibility. Ranges from 0 to 1, with higher values indicating greater similarity.
PerRR (Performance Reproduction Ratio): A measure of how well the performance of an LLM is reproduced when using the ReQuest algorithm. Calculated as a percentage, with values closer to 100% indicating higher reproducibility.
PreRR (Prediction Reproduction Ratio): A measure of how well the specific predictions of an LLM are reproduced at the individual data point level. Calculated as a Jaccard similarity score, with values closer to 1 indicating higher prediction-level reproducibility.

Approach:

1. Outline of the Research Methodology:

Task Selection: The researchers chose three high-stakes domains: law, health, and finance.
Dataset Selection: They selected relevant datasets for each domain, such as legal statute prediction (45 fact texts, 18 unique statutes), human rights violation detection (11.5k cases), stock price prediction (data from 88 stocks), suicide risk assessment (5 classes), and depression severity detection (4 classes).
LLM Selection: They used two state-of-the-art open-source LLMs: Gemini and Llama3 (70b variant).
Prompting: They developed a three-stage prompting regime:
- Task Prompt: Prompts the LLM to perform a specific task (e.g., predict relevant statutes for a given legal fact statement).
- ReQuest Prompt: Prompts the LLM to generate an algorithm explaining its reasoning (e.g., "What steps did you follow to arrive at these statutes?").
- Robustness Check Prompt: Prompts the LLM to perform the same task using the generated algorithm (e.g., prompts the LLM to act as a bot and follow the steps of the algorithm).
Evaluation: They evaluated the reproducibility of the LLM's responses using two main metrics:
- PerRR (Performance Reproduction Ratio): Measures the overall performance reproducibility using Macro F1 scores.
- PreRR (Prediction Reproduction Ratio): Measures the prediction-level reproducibility using Jaccard similarity scores.
Intrinsic Reasoning Alignment: They explored the alignment between the generated algorithms and the LLM's intrinsic reasoning using a method based on decoding paths, comparing the top-k tokens generated during the LLM's internal processing with the steps of the generated algorithm.

2. Describe Problem-Solving Techniques:

ReQuesting: The core technique is prompting the LLM to generate a human-understandable algorithm that explains its reasoning process. This algorithm is essentially a set of steps that a person could follow to arrive at the same conclusion as the LLM.
Reproducibility as a Proxy for Faithfulness: The researchers used reproducibility, quantified by PerRR and PreRR scores, as a measure of how faithful the generated algorithm is to the LLM's internal process. The idea is that if the algorithm accurately reflects the LLM's reasoning, then using the algorithm should produce similar results to the LLM's direct output.
Intra-LLM and Inter-LLM Evaluation: They tested the reproducibility both within the same LLM and across different LLMs to assess the consistency and generalizability of the generated algorithms. For example, they compared PerRR_GP_G (Gemini generating and using the algorithm) with PerRR_GP_L (Gemini's algorithm used by Llama).
Alignment with Intrinsic Reasoning: They adapted a method from previous research to explore the internal reasoning paths of the LLM during task execution and compared these paths to the generated algorithms. They used a smaller LLaMA 3.2-1B model to analyze the top-k tokens generated at each step and compared them with the steps of the ReQuest algorithm.

Results and Evaluation:

1. Key Findings:

High Overall Reproducibility: The ReQuesting method generally achieved high overall reproducibility (PerRR) across different tasks and LLMs. In the legal domain, PerRR scores were often above 90%, and in the health domain, they were frequently above 95%. This suggests that LLMs can generate algorithms that reflect their performance reasonably well.
Variable Prediction-Level Reproducibility: Prediction-level reproducibility (PreRR) was more variable. For instance, in statute prediction, PreRR values ranged from 0.3880 to 0.6083, while in the health domain, PreRR values for suicide risk assessment were around 0.6-0.7 but lower for depression severity detection. This indicates that LLMs might not always make consistent predictions at the individual data point level.
Domain-Specific Differences: Reproducibility was particularly high in the legal domain (PerRR often above 95% for human rights prediction), suggesting that LLMs might be better suited for generating explainable algorithms in this area.
Alignment with Intrinsic Reasoning: Preliminary analysis suggested a degree of alignment between the generated algorithms and the LLM's intrinsic reasoning paths, providing some evidence that the algorithms capture aspects of the LLM's internal decision-making process. For example, in Table 5, specific steps in the ReQuest algorithm for suicide risk assessment were shown to correspond to key phrases identified in the LLM's internal reasoning process.

2. Quantitative Results:

Domain	Model Comparison	PerRR (%)	PreRR
Legal Domain
	Gemini (Intra-LLM)	98.11	0.5188
	Llama (Intra-LLM)	91.39	0.6083
	Gemini → Llama (Inter-LLM)	93.16	0.4487
	Llama → Gemini (Inter-LLM)	84.98	0.3880
Human Rights
	Gemini (Intra-LLM)	92.01	0.9613
	Llama (Intra-LLM)	95.85	0.9609
	Gemini → Llama (Inter-LLM)	93.45	0.9377
	Llama → Gemini (Inter-LLM)	95.39	0.9550
Finance Domain
Apple (AAPL)	Gemini (Intra-LLM)	91.00	0.6458
	Llama (Intra-LLM)	92.68	0.9375
	Gemini → Llama (Inter-LLM)	91.67	0.4791
Google (GOOG)	Gemini (Intra-LLM)	97.82	0.5464
	Llama (Intra-LLM)	94.59	0.9746
	Gemini → Llama (Inter-LLM)	93.75	0.6473
Health Domain
Suicide Watch	Gemini (Intra-LLM)	90.54	0.6170
	Llama (Intra-LLM)	94.34	0.7010
	Gemini → Llama (Inter-LLM)	95.47	0.7280
Depression Severity	Gemini (Intra-LLM)	97.54	0.6720
	Llama (Intra-LLM)	98.70	0.9860
	Gemini → Llama (Inter-LLM)	86.70	0.6570

3. Notable Achievements:

Novel Method for Explainability: The paper introduces a novel and effective method (ReQuesting) for generating explainable algorithms from LLMs, addressing a critical need for transparency in high-stakes domains. The quantitative results, particularly the high PerRR scores in many cases, demonstrate the effectiveness of this method.
Demonstrated Reproducibility: The research demonstrates that LLMs can produce reproducible results, which is crucial for building trust in their applications. The high PerRR scores across various tasks and domains provide strong evidence for this reproducibility.
Insights into LLM Reasoning: The study provides valuable insights into the internal reasoning processes of LLMs and how they can be made more transparent and understandable. The quantitative analysis of intrinsic reasoning alignment, although preliminary, offers a promising avenue for further research.

Practical Deployment and Usability:

Real-World Applicability: The ReQuesting method has significant real-world applicability in domains where understanding the reasoning behind AI decisions is crucial, such as:
- Law: Assisting legal professionals in understanding and validating LLM-generated legal analyses. For example, a lawyer could use ReQuesting to understand why an LLM predicted a particular legal statute to be relevant to a case, with the quantitative results showing high confidence in the reproducibility of such predictions (e.g., PerRR of 98.11% for Gemini in statute prediction).
- Healthcare: Helping doctors understand and trust LLM-based diagnoses or treatment recommendations. A doctor could use ReQuesting to understand why an LLM diagnosed a patient with a specific condition, with the quantitative results providing a measure of confidence in the consistency of the diagnosis (e.g., PerRR of 97.54% for Gemini in depression severity detection).
- Finance: Providing investors with insights into the reasoning behind LLM-driven financial predictions. A financial analyst could use ReQuesting to understand why an LLM predicted a stock price to rise or fall, with the quantitative results offering a measure of the prediction's reliability (e.g., PerRR of 91% for Gemini in Apple stock prediction).
Ease of Use: The method is relatively easy to use, as it relies on prompting LLMs, which is a common way of interacting with these models. The prompts used in the study are provided in the appendix, making it straightforward for others to replicate and adapt the method.
Examples:
- A lawyer could use ReQuesting to understand why an LLM predicted a particular legal statute to be relevant to a case, supported by the high PerRR scores observed in the legal domain.
- A doctor could use it to understand why an LLM diagnosed a patient with a specific condition, with the confidence provided by the PerRR and PreRR scores for the health domain.
- A financial analyst could use it to understand why an LLM predicted a stock price to rise or fall, using the PerRR and PreRR scores to assess the reliability of the prediction.

Limitations, Assumptions, and Caveats:

Limited Scope of LLMs: The study focused on two specific open-source LLMs (Gemini and Llama3), and the findings might not generalize to other LLMs, especially commercial ones. This limits the external validity of the results.
Zero-Shot Prompting: The study primarily used zero-shot prompting, which might not be the most effective way to elicit accurate and consistent responses from LLMs. Fine-tuning or few-shot prompting might yield different results.
Computational Resources: The analysis of intrinsic reasoning was limited by computational resources. The use of a smaller LLaMA 3.2-1B model for this analysis might not fully capture the complexities of the larger models used in the main experiments.
Assumptions:
- Reproducibility, as measured by PerRR and PreRR, is a good proxy for faithfulness. This assumes that consistent performance and predictions indicate that the generated algorithm accurately reflects the LLM's internal process.
- The generated algorithms capture the essential aspects of the LLM's reasoning process. This is a simplification, as the algorithms might not capture all the nuances of the LLM's internal computations.
Caveats:
- The study does not claim that the generated algorithms perfectly reflect the LLM's internal workings. The algorithms are presented as human-understandable approximations of the LLM's reasoning.
- The variability in prediction-level reproducibility (PreRR) suggests that LLMs might not always be reliable at the individual data point level. This is particularly important in high-stakes domains where individual predictions have significant consequences.

Promises and Horizons:

Potential Benefits:
- Increased Trust: ReQuesting can enhance trust in LLMs by making their decision-making processes more transparent and understandable. The quantitative results, such as high PerRR scores, provide evidence for the reliability of the generated explanations.
- Improved Accuracy: By understanding how LLMs reason, developers can potentially improve their accuracy and reliability. The insights gained from analyzing PerRR and PreRR scores can guide model refinement.
- Ethical Considerations: The method can help address ethical concerns related to the use of AI in high-stakes domains by providing a means to scrutinize and validate LLM-based decisions.
Future Research:
- Exploring the use of ReQuesting with other LLMs, including commercial ones, to assess the generalizability of the findings.
- Investigating the effectiveness of different prompting strategies (e.g., few-shot prompting, chain-of-thought prompting) for eliciting more accurate and consistent algorithms.
- Conducting a more comprehensive analysis of the alignment between generated algorithms and LLM's intrinsic reasoning, using larger models and more sophisticated analysis techniques.
- Developing methods for automatically evaluating the quality and faithfulness of generated algorithms, potentially using metrics beyond PerRR and PreRR.
Evolution:
- ReQuesting could evolve into a standard technique for developing and deploying LLMs in high-stakes domains, with the quantitative evaluation metrics (PerRR and PreRR) becoming integral to the development process.
- It could lead to the development of new tools and interfaces for interacting with LLMs and understanding their reasoning, potentially incorporating visualizations of PerRR and PreRR scores to aid in interpretation.
- It could contribute to the development of more responsible and trustworthy AI systems by providing a mechanism for ensuring transparency and accountability in AI-driven decision-making.

Conflict of Interest:

The authors do not explicitly mention any conflicts of interest. Potential biases could arise from the choice of LLMs, datasets, or evaluation metrics. These biases could potentially influence the results by favoring certain types of LLMs or tasks. For instance, if the chosen datasets are not representative of real-world scenarios, the reproducibility results (PerRR and PreRR scores) might not be accurate.