A significant topic of discussion in contemporary artificial intelligence research is whether large language models (LLMs), such as ChatGPT, possess genuine reasoning abilities. This article examines recent research that contributes to this debate without taking a definitive stance on the correctness of either position.
For readers new to the topic, we encourage you to consult Part 1: “How Confident Should You Be in AI Reasoning Abilities?” This preceding article explores the initial discussions surrounding the reasoning abilities of AI models, the challenges they face in tasks that are simple for humans, and practical solutions to minimize AI reasoning errors.
Origins of the Debate
To begin with, how did this debate arise? The debate intensified following OpenAI’s release of GPT-4o (omni) in May 2024, which was described as “capable of reasoning in real-time over audio, visual input, and text.” Subsequently, the GPT-o1 model was introduced, purportedly capable of “complex reasoning” and achieving record accuracy on benchmarks that heavily rely on reasoning. Users of the GPT-o1 model may have observed a brief pause before the model generates its output; during this interval, the AI appears to “reason” by producing multiple intermediate results that are eventually combined into a final response.
According to some users, a “moment of engagement” arises during the reasoning when the model takes longer to come up with an answer. This implies that a good question has been asked, which can improve the user experience from a marketing perspective. Users unconsciously experience a feeling of appreciation or a compliment when the AI model seems to think longer, as if it reflects more deeply on the question.
However, other researchers argue that LLMs, including advanced models like GPT-4o and GPT-o1, do not engage in abstract reasoning. They assert that the success of these models is partially due to their ability to recognize and reproduce reasoning patterns learned during training. This reliance on previously learned patterns may limit their capacity to solve problems that significantly deviate from their training data.
Actual Reasoning versus Apparent Reasoning Behavior
Does it matter whether LLMs are performing “actual reasoning” versus behavior that looks like reasoning?
On the one hand, if robust, generally applicable reasoning skills have developed in LLMs, this strengthens the assertion that such systems are an important step toward reliable general intelligence. This would mean that they can not only respond to known patterns but also handle new, unseen problems in a way comparable to human reasoning. For the academic world, this could provide support in how we conduct research, gather knowledge, and disseminate it.
On the other hand, critics suggest that LLMs mainly rely on memorization and pattern recognition rather than real reasoning. They argue that these models, although impressive to many in their output, are limited in their ability to generalize to “out-of-distribution” tasks, tasks that differ significantly from what they have seen during training. This would mean that their applicability in academic contexts is limited, especially when it comes to tackling entirely new or complex research problems that require logical reasoning and critical thinking.
The study discussed in this article assess the robustness of LLMs’ reasoning abilities by introducing superficial variations to tasks on which these models perform well. These variations do not alter the underlying reasoning required but are less likely to have been encountered in the training data. The objective is to determine whether the models are capable of abstract reasoning or are simply reproducing patterns learned during training.
Paper Embers of Autoregression
This study examines whether the training method of LLMs, the prediction of the next token in a sequence, known as autoregression, has lasting effects (“embers“) on their problem-solving abilities. The authors wonder to what extent this method influences the biases and limitations of the model when solving tasks.
Explanation of Jargon
Embers: Refers to the lasting influences or remnants of the autoregressive training method on the model’s behavior in problem-solving. In other words, does the LLM simply remember its training very well and can it retrieve it effectively each time.
Sequence: A series of elements, such as words or letters, in a specific order. In language models, this often refers to a sentence or a series of words.
Autoregression: A training method where the model is trained to predict the next element in a sequence based on the preceding elements. For language models, this means predicting the next word in a sentence, given the previous words. A daily example is when you type a message in WhatsApp, and a word appears that is most likely to come next; this is autoregression.
Experiment: Reversing Word Sequences
The researchers presented two sentences and asked the model to reverse the order of the words. Before discussing the results, I invite you to read the sentences yourself and perform the task: Reverse the order of the words in each sentence so that you get the original sentence.
- “time. the of climate political the by influenced was decision This”
- “letter. sons, may another also be there with Yet”
If you reverse the first sentence, it results in a coherent and understandable sentence:
“This decision was influenced by the political climate of the time.”
But when you reverse the second sentence:
“Yet there with be also may another sons, letter.”
it is an illogical output. However, the procedure of the task is the same and therefore if you are a human it is not more difficult or easier whether the resulting output is illogical or not.
Although both sentences require the same underlying task, reversing the word order, GPT-4 performed significantly better on the first sentence than on the second. This is because the reversed version of the first sentence yields a coherent and common sentence (and therefore more likely to occur in the training data), whereas this is not the case with the second. This phenomenon is referred to as “sensitivity to output probability.” The model tends to perform better when the output has more frequently appeared in the training data.
The authors evaluated GPT-4 on this task using various word sequences and discovered that GPT-4 achieved 97% accuracy (percentage of correctly reversed sequences) for high-probability sequences, compared to 53% accuracy for low-probability sequences.
Experiment: Shift Ciphers
One of the tasks employed by the authors to study the sensitivities of Large Language Models (LLMs) is the decoding of shift ciphers. A shift cipher is a simple encryption method where each letter in a text is shifted by a specific number of positions in the alphabet. For example, with a shift of two, the word “jazz” becomes “lcbb” (where shifting “z” wraps around to the beginning of the alphabet). Shift ciphers are often denoted as “ROT-n,” where n represents the number of positions each letter is rotated.
The experiments underscore that LLMs are sensitive to input and output probabilities, as well as to task frequency.
Notably, ROT-13 outperformed other shift ciphers in experiments involving LLMs. This can be attributed to task frequency; ROT-13 is widely used, particularly on online forums to conceal spoilers. Its prevalence means that ROT-13 is well-represented in the textual data on which LLMs are trained. In contrast, less common shifts, such as ROT-9, appear less frequently in training data, resulting in LLMs having less experience with these ciphers and thus lower accuracy.
Furthermore, the authors assert that the complexity of the different shifts is not the cause of the disparity in performance. ROT-1, ROT-3, and ROT-13 all require the same fundamental skills, counting and shifting within the alphabet. Interestingly, GPT-4 performed better on ROT-1 and ROT-3 than on ROT-13, despite ROT-13 being the most prevalent in the training dataset. Their explanation lies in the frequency with which the different ciphers appear in the training data. LLMs are sensitive to this task probability; they learn to perform better on tasks they have encountered more frequently during training.
Understanding the Sensitivity
LLMs are trained on vast amounts of text data, learning statistical patterns and frequencies of word usage, phrases, and structures. As indicated, LLMs are not only sensitive to task frequency but also to the probability of the input and output. This means they perform better when both the input text and the required output text frequently occur in the data on which they were trained.
The underlying reason is that LLMs are essentially statistical systems. They learn patterns and relationships from the enormous volumes of text processed during training. These patterns include not only the structure of language but also the frequency with which certain words, phrases, and even entire tasks occur. When an LLM performs a task, such as translating text or answering a question, it relies on the statistical patterns it has learned. If the input or the expected output has a low probability, meaning it rarely occurs in the training data, the model has fewer cues to perform the task correctly. LLMs tend to generate outputs that are statistically likely based on their training data, even if the task requires a deterministic output. With rare or unusual inputs, performance declines. For instance, GPT-4 achieved 21% accuracy when encoding high-probability sentences using ROT-13 but only 11% with low-probability sentences.
Comparison with Human Performance: It is important to note that this sensitivity to probability marks a significant difference between LLMs and humans. A person who can decode ROT-13 can likely decode other ROT variants with the same accuracy, as the underlying logic remains consistent. LLMs, however, rely heavily on the statistical patterns present in their training data and perform less effectively when faced with less frequent or unlikely inputs or outputs.
Conclusion
The sensitivity of LLMs to input and output probability is a direct consequence of their statistical nature. This sensitivity has important implications for the applications of LLMs and highlights the necessity of considering task frequency and the probability of inputs and outputs when evaluating and deploying these systems. In summary, “Embers of Autoregression” functions as a form of “evolutionary psychology” for LLMs. It demonstrates that the manner in which LLMs are trained leaves strong imprints, embers, in the biases that the models exhibit when solving problems.