dybilar

2407.04014v1 Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems - Extended with Claude 3 Tokenization Investigation

Workload-Based Energy Models for LLM Inference

Claude 3 tokenization: Max Buckley LI post & Beren blog

TL;DR: This note examines recent research on energy consumption during Large Language Model (LLM) inference on systems using both GPUs and CPUs. The researchers propose models that predict energy usage and runtime based on the workload, achieving over 96% accuracy (R² > 0.96). This opens up possibilities for scheduling LLM tasks in a way that conserves energy, potentially saving kilowatts of power per request. We also discuss how this research could guide future investigations into novel tokenization strategies, particularly the intriguing case of Claude 3.

Application: As LLMs become more prevalent, their energy footprint during inference (using the model, not just training it) is a growing concern. This research helps data center operators find the sweet spot between LLM accuracy and energy efficiency.

Unexpected Findings:

  • Output Token Dominance: Both the number of input and output tokens affect energy and runtime, but output tokens have a much stronger impact. This implies that generating text is more energy-demanding than processing input. Evidence is seen in the ANOVA results, where the F-statistic for output tokens (126.63 for energy and 104.98 for runtime) far exceeds that of input tokens (15.86 for energy and 12.97 for runtime) (Table 2).

  • SMoE Efficiency: Sparse Mixture-of-Experts (SMoE) models, like Mixtral, are significantly more energy-efficient and faster than denser LLMs, especially when handling longer inputs and outputs. This highlights the potential of sparse architectures for sustainable AI (clearly shown in Figures 1 and 2, where Mixtral's energy per token and throughput excel, particularly at higher token counts).

Key Terms:

  • LLM Inference: Using a trained LLM to generate text or perform language tasks based on new input.
  • Workload: The set of inference requests sent to an LLM, described by the length of input and output in tokens. Think of it like a list Q, where each item is a pair of numbers: (number of input tokens, number of output tokens).
  • Heterogeneous System: A computer system that combines different types of processors (CPUs and GPUs) to handle various tasks. This paper focuses on systems using Nvidia A100 GPUs and AMD Epyc CPUs.
  • Energy Model: A mathematical model predicting how much energy an LLM will consume based on the characteristics of the workload. The proposed model looks like this: e = α₀ + α₁ (input tokens) + α₂ (output tokens) + α₃ (input tokens * output tokens), where the α values are specific to each model and are determined through regression analysis.
  • Runtime Model: Similar to the energy model, but this one predicts how long it takes for an LLM to process a given workload. The form is the same as the energy model, just with different coefficients (β instead of α).
  • Sparse Mixture-of-Experts (SMoE): An LLM design where only a subset of the model's parameters is activated for each request, making it more efficient. Mixtral, for example, has 8x7B parameters but only uses about 12B on average, contributing to its performance gains.
  • Accuracy (A): How well an LLM performs on various language tasks, often represented as a percentage. This paper uses the average accuracy (A) from the Hugging Face Leaderboard, which combines performance across multiple benchmarks.
  • Zero-Width Character: An invisible character in Unicode that doesn't take up any space when displayed but can carry out specific functions or hold information.

Approach:

  1. LLM Selection: Researchers selected several open-source LLMs with varying sizes and architectures, including Falcon, Llama-2, and Mistral (Table 1).
  2. Energy Profiling: They meticulously measured energy consumption of both CPUs and GPUs during LLM inference using specialized tools (PyJoules, AMD µProf) to see how it relates to different workloads.
  3. Controlled Experiments: They systematically tested various input and output token lengths (input: 8 to 2048 tokens; output: 8 to 4096 tokens), to isolate their individual and combined effects on energy consumption and runtime.
  4. Model Development: Using the data from their experiments, they built their energy and runtime models. These models achieved high accuracy (R² > 0.96) in predicting energy consumption and runtime based on the workload (Table 3).
  5. Offline Optimization: The researchers devised an optimization problem (Equation 2) to minimize energy consumption without sacrificing too much accuracy. Their developed models were then used to guide decisions about which LLM to use for a given task, potentially leading to significant energy savings.

Table 1: LLM Energy Consumption and Runtime

LLM (# Params) vRAM Size (GB) # A100s A (%)
Falcon (7B) 14.48 1 44.17
Falcon (40B) 83.66 3 58.07
Llama-2 (7B) 13.48 1 50.97
Llama-2 (13B) 26.03 1 55.69
Llama-2 (70B) 137.98 4 64.52
Mistral (7B) 15.00 1 60.97
Mixtral (8x7B) 93.37 3 68.47

Results and Evaluation:

  • Workload-Based Models: The models successfully demonstrated that energy consumption can be predicted based on the workload (Table 3), paving the way for more energy-aware systems.
  • Energy-Accuracy Trade-off: The optimization framework allows data center operators to control the trade-off between energy efficiency and accuracy using a parameter called ζ (zeta). This provides flexibility depending on operational needs and priorities.
  • Case Study: A simulation using the Llama-2 models showed that the proposed scheduler could reduce energy consumption by up to 8 kWh and save over 100 seconds of runtime, all while maintaining acceptable accuracy. This was achieved by intelligently routing 500 queries from the Alpaca dataset to the most appropriate Llama-2 model (7B, 13B, or 70B parameters) (Figure 3).

Table 2: ANOVA Results for LLM Energy Consumption and Runtime

Metric Variable Sum of Squares F-statistic p-value
Energy (J) Input Tokens 5.17 × 10¹⁰ 15.86 3.79 × 10⁻¹⁷
Output Tokens 4.13 × 10¹¹ 126.63 1.22 × 10⁻⁶⁵
Interaction 1.18 × 10¹¹ 4.53 4.67 × 10⁻¹⁵
Runtime (s) Input Tokens 3.43 × 10⁵ 12.97 2.34 × 10⁻¹⁴
Output Tokens 2.78 × 10⁶ 104.98 4.56 × 10⁻⁶⁰
Interaction 8.21 × 10⁵ 3.88 1.92 × 10⁻¹²

Practical Deployment and Usability:

  • Practicality: The paper's framework for building energy-aware LLM systems is directly applicable to real-world data centers. The models can be implemented easily, and the optimization framework puts the control of the energy-accuracy trade-off in the hands of the operators.

NVIDIA A100 GPU Note:

The study heavily relies on the Nvidia A100 GPU for both performance and energy measurements.

  • Generalizability: The energy models might not apply directly to other GPUs. Different architectures or even different Nvidia generations could have varying energy consumption characteristics.
  • Power Management: The study disabled some power-saving features (like dynamic frequency scaling) to ensure consistent measurements. Real-world systems typically use these features, which would affect energy consumption.

Threats to Validity:

Internal Validity:

  • Measurement Accuracy: The accuracy of the energy and runtime data depends on the tools used (PyJoules and AMD µProf). Any errors in these tools could affect the models.
  • Workload Representativeness: The chosen workload might not reflect the full range of LLM tasks in the real world.
  • Disabled Key-Value Caching: Disabling this optimization could affect energy consumption. Systems with caching enabled might have different energy profiles.

External Validity:

  • Single Node HPC Environment: The study's focus on a single HPC node limits its applicability to large data centers with more complex setups and network overheads.

Promises and Horizons:

  • Real-Time Optimization: These models could be used to create online scheduling algorithms that dynamically adjust LLM usage for energy efficiency in real-time.

Addendum: Investigating Claude 3's "Mystery Token"

Recent analyses have shown that Claude 3 handles numerical information in a unique way:

  • Right-to-Left (R2L) Tokenization: Claude 3 tokenizes numbers from right to left, potentially mimicking human cognition and improving numerical reasoning.
  • Mystery Number Token: An unknown token appears before number sequences in specific situations (after non-word characters), suggesting a possible role in numerical processing.

Hypothesis: The Mystery Token is a Zero-Width Character:

There is a hypothesis that this "mystery token" is a zero-width character—an invisible character used for specific functionalities. This is based on:

  • Invisibility: It's a "mystery" token because it's not visually apparent.
  • Technical Feasibility: LLMs work with token IDs, so they can process zero-width characters even though they are invisible.

Leveraging Energy Consumption Analysis:

One researcher suggests investigating this by:

  1. Token-Level Energy Profiling: Measure Claude 3's energy consumption at a very granular level, specifically focusing on the "mystery token."

  2. Controlled Experiments:

  3. Prompt Design: Create two sets of prompts:

    • Group A: Numbers preceded by non-word characters (where the "mystery token" is expected).
    • Group B: The same numbers but preceded by word characters (no "mystery token" expected).
  4. Energy Measurement: Carefully measure Claude 3's energy consumption for each prompt.
  5. Comparison: Look for any consistent differences in energy use between Group A and Group B that correlate with the presence of the "mystery token."

  6. Developing a Claude 3 Energy Model:

  7. Model Structure: Expand the paper's energy model to include a variable (M) representing whether or not the "mystery token" is present. The model would look something like this: E = α₀ + α₁ (input tokens) + α₂ (output tokens) + α₃ (input tokens * output tokens) + α₄ (M) + α₅ (M * input tokens) + α₆ (M * output tokens), where E is energy consumption and M is 0 or 1 depending on the presence of the mystery token.

  8. Regression Analysis: Analyze the energy data to determine the model's coefficients (the α values) and see if they are statistically significant.
  9. Interpretation: If the coefficient for the "mystery token" (α₄) is significant, it supports the idea that it has a unique impact on energy consumption, strengthening the zero-width character hypothesis.

  10. Integrating Tokenization Data:

  11. Token ID Analysis: Examine the token IDs generated by Claude 3 for the experimental prompts.

  12. Correlation: See if any energy anomalies match up with invisible token IDs that are known to represent zero-width characters.

Summa Summarum: This research provides a practical way to understand and optimize the energy consumption of large language models. By accurately modeling the relationship between workload and energy use, the researchers move us towards more sustainable AI. Their innovative methodology could also be used to investigate new and intriguing aspects of LLMs, such as unraveling the secrets of Claude 3's "mystery token."

Conflict of Interest: None explicitly stated.