Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

Abstract

While large language models (LLMs) have rapidly improved their performance on a broad number of tasks, they still often fall short on reasoning tasks. As LLMs become more integrated in diverse real-world tasks, advancing their reasoning capabilities is crucial to their effectiveness in nuanced, complex problems.

(Wang et al)'s framework reveals that sampling multiple rationales before taking a majority vote reliably improves model performance across various closed-answer reasoning tasks. Standard methods based on this framework aggregate the final decisions of these rationales but fail to utilize the detailed step-by-step reasoning paths applied by these paths.

Our work enhances this approach by incorporating and analyzing both the reasoning paths of these rationales in addition to their final decisions before taking a majority vote. These methods not only improve the reliability of reasoning paths but also cause more robust performance on complex reasoning tasks.

Simplified Approach

Methodology

Centroid Proximity Weighting

Here we map responses as embedding vectors, calculate their centroid, and measure distances from it. Closer vectors get higher weights, and the total weight for each output determines the most likely correct answer.

Dataset Method/Metric Llama 2 7B Mistral 7B GPT 3.5 Llama 3 8B GPT-4o mini
AQuA-RAT SC baseline 24.80 25.60 59.40 45.28 83.07
CPW 24.60 (-0.2) 29.00 (+3.4) 68.00 (+8.6) 46.06 (+0.78) 82.68 (-0.39)
SVAMP SC baseline 46.50 68.50 79.80 73.33 89.80
CPW 47.40 (+0.9) 69.80 (+1.3) 81.00 (+1.2) 74.67 (+1.34) 89.60 (-0.2)
StrategyQA SC baseline 48.91 67.98 66.81 63.32 79.18
CPW 55.02 (+6.11) 60.70 (-7.28) 65.21 (-1.6) 63.32 (+0.0) 73.80 (-5.38)

Semantic Consensus Weighting

We compare embeddings by using cosine similarity to weigh responses. For each embedding, we calculate its cosine similarity with every other embedding and sum the scores. These scores are then aggregated to identify the response with the highest overall consensus.

Dataset Method/Metric Llama 2 7B Mistral 7B GPT 3.5 Llama 3 8B GPT-4o mini
AQuA-RAT SC baseline 24.80 25.60 59.40 45.28 83.07
SCW 25.00 (+0.2) 29.80 (+4.2) 65.40 (+6.0) 47.48 (+2.2) 86.18 (+3.11)
SVAMP SC baseline 46.50 68.50 79.80 73.33 89.80
SCW 46.90 (+0.4) 70.20 (+1.7) 80.30 (+0.5) 73.00 (-0.33) 92.38 (+2.98)
StrategyQA SC baseline 48.91 67.98 66.81 63.32 79.18
SCW 62.44 (+13.53) 65.35 (-2.63) 74.70 (+7.89) 71.47 (+8.15) 79.68 (+0.5)

Related Work

There's a lot of great work that investigates a similar principle as ours.

Large Language Models Can Self-Improve proposes a self-consistency fine-tuning method on self-generated rationale-augmented outputs, enhancing reasoning capabilities in unsupervised contexts.

Improving Retrieval-Augmented Large Language Models via Data Importance Learning introduces a reweighting algorithm using multilinear extensions to evaluate retrieved data relevance, thus improving performance without additional model training.

Self-Influence Guided Data Reweighting for Language Model Pre-training presents a method that assigns weights based on self-influence scores during pretraining, focusing on higher-quality samples to optimize model training.

Importance Weighting Can Help Large Language Models Self-Improve proposes a novel DS weight metric to filter self-generated samples with high distribution shift, enhancing self-improvement in reasoning tasks without extensive external supervision.

BibTeX

@article{knappe2024enhancinglanguagemodelreasoning,
      title={Enhancing Language Model Reasoning via Weighted Reasoning in Self-Consistency}, 
      author={Tim Knappe and Ryan Li and Ayush Chauhan and Kaylee Chhua and Kevin Zhu and Sean O'Brien},
      year={2024},
      eprint={2410.07839},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.07839}, 
}