Benchmarking AnLLMs: Insights from OpenBookQA to BoolQ | HackerNoon

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

A More Experimental Results

B Data Settings

4.3 Evaluation

In our investigation, we employ a diverse collection of benchmarks with varying text lengths to evaluate our outcomes, including OpenBookQA (OBQA) (Mihaylov et al., 2018), WinoGrande (WG) (Sakaguchi et al., 2021), ARC-easy (ARC-e) and ARCchallenge (ARC-c) (Clark et al., 2018), PIQA (Bisk et al., 2020), HellaSwag (HS) (Zellers et al., 2019), SCIQ (Welbl et al., 2017), and BoolQ (Clark et al., 2019). These benchmarks provide a comprehensive evaluation of various aspects, including reasoning, comprehension, understanding of the physical world, and predicting future events. Importantly, they cover texts of varying lengths, facilitating a thorough assessment of our model’s performance across diverse tasks and text complexities, ranging from shorter input contexts in OBQA to longer texts in BoolQ. To measure the precision and efficiency of our models, we evaluate them across three dimensions using three distinct metrics for both zero-shot and five-shot settings. For AnLLMAC in the five-shot setting, we incorporate the anchor token at the end of each demonstration.

• Accuracy (Acc). This conventional metric is utilized to gauge the prediction accuracy of models. In accordance with previous studies (Gao et al., 2023), we choose the options with the highest probabilities as predictions and calculate accuracy using the gold-standard labels.

• Keys/Values Caches Reduction (C⇓). In the context of the five-shot evaluation, the demonstrations can be cached in GPU memory for subsequent reuse. Nevertheless, extended demonstrations may require increased memory consumption. This metric is designed to assess the memory efficiency of the AnSAN technique.

• Inference Acceleration Ratio (T⇑). Similar to Wang et al. (2023), capitalizing on the cached keys/values, we present the Inference acceleration ratio, which serves as an indicator of the inference efficiency of the AnSAN technique.

Note that we first report full attention inference results for all models, then present results with the AnSAN method (+AnSAN) applied, compressing sequence information into anchor tokens.