TY - GEN
T1 - Self-Consistency of Large Language Models under Ambiguity
AU - Bartsch, Henning
AU - Jorgensen, Ole
AU - Rosati, Domenic
AU - Hoelscher-Obermaier, Jason
AU - Pfau, Jacob
N1 - Publisher Copyright:
©2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency–e.g. question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model’s consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.
AB - Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency–e.g. question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model’s consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models displaying both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self-consistent models internally compute multiple possible responses.
UR - http://www.scopus.com/inward/record.url?scp=85184814692&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85184814692
T3 - BlackboxNLP 2023 - Analyzing and Interpreting Neural Networks for NLP, Proceedings of the 6th Workshop
SP - 89
EP - 105
BT - BlackboxNLP 2023 - Analyzing and Interpreting Neural Networks for NLP, Proceedings of the 6th Workshop
A2 - Belinkov, Yonatan
A2 - Hao, Sophie
A2 - Jumelet, Jaap
A2 - Kim, Najoung
A2 - McCarthy, Arya
A2 - Mohebbi, Hosein
PB - Association for Computational Linguistics (ACL)
T2 - 6th Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP 2023
Y2 - 7 December 2023
ER -