How Confident Should You Be in AI Reasoning Abilities?

The world of AI has made enormous progress in recent years, and the implications for education are becoming increasingly clear. At Tilburg.ai, we work daily to make sure this integration proceeds as responsibly as possible. Our chatbot is specifically created for each course, using the course material of that subject, so that the answers are provided at an academically sound level.

But for a ChatGPT user, it is logical that you sometimes feel like you are talking to an all-knowing model. It is therefore logical that you think: an ideal tool that helps in think through my ideas, understand logic, and make connections. With the rise of new models like OpenAI’s GPT-4o-1, known by the makers for its advanced reasoning capabilities, the question arises: can a model really reason like we do? Let’s explore the current state of affairs.

Reasoning Ability of AI Models

AI technology has made enormous progress in recent years and this resulted in impressive achievements, such as designing new proteins, early detection of developing tumors, and predicting complex weather patterns. These applications show how powerful and versatile AI can be in solving problems that are difficult for humans to comprehend.

Yet it is interesting to see that there are tasks that AI systems find easy but are difficult for humans, and vice versa. AI models, for example, can analyze vast amounts of data and recognize patterns that remain invisible to the human brain. On the other hand, they struggle with simple tasks that are self-evident for humans.

For example, it has taken a long time for language models to solve simple arithmetic problems correctly. A question like “How many ‘r’s are in the word ‘strawberry’?” can be challenging for an AI model, while for humans it is a matter of simply counting. Also, in games like tic-tac-toe, AIs sometimes struggle to determine the best move in a given situation. But when you ask them to program the game, they create a very strong player. Simple arithmetic often goes wrong, with AIs sometimes giving surprisingly incorrect answers to basic mathematical questions.

The Strawberry Problem

A surprisingly difficult challenge for language models like LLMs (Large Language Models) is how they handle segmenting words, especially when counting specific letters or characters, such as the number of times the letter occurs in “strawberry”.

In LLMs, words are often split into smaller units, called tokens, because some words have complex structures or consist of multiple parts. In this case, “strawberry” consists of three tokens, indicating that the model splits the word into three pieces (for example, “st”, “raw”, and “berry” as an example of possible tokenization). This makes it more difficult for the model to quickly identify the exact number of specific letters, such as the ‘R’, because the segmentation divides the word into parts where that letter can occur spread out.

Unlike how humans see words directly as a whole and can count, an LLM works with these split tokens, making it not always easy to quickly count a letter like the ‘R’ in the entire word.

*This is how ChatGPT-4 “reads” the word strawberry.*

How can we solve this?

Instead of relying on the internal tokenization of a language model, we can invoke the model to reason. How? Simply look at what it is good at, for example, ask it to write a programming language script that directly iterates through the characters of a word. In this way, we can accurately count how often a specific letter occurs. This is a perfect example of how you can combine AI with traditional programming solutions to circumvent the limitations of LLMs.

Write a Python script with two parameters: a letter and a word. The script should count how many times the letter occurs in the word. For example, count how many ‘r’s appear in the word ‘strawberry’. Also, state the result of the example.

Response:

De AI geeft ons het volgende antwoord:

Python

# Here's the Python script that counts how many times a letter occurs in a word:
def count_letter_in_word(letter, word):
  return word.count(letter)

# Example: Counting how many 'r's appear in 'strawberry'
letter = 'r'
word = 'strawberry'
count = count_letter_in_word(letter, word)
count

## Result 
3

Bingo! This example shows that, although LLMs are powerful for many complex language analyses, they are not always suitable for simple operations like counting individual letters in a word, due to their reliance on tokenization.

Pattern Recognition vs. Logical Reasoning in AI Models

Recent research by Farajtabar and his colleagues investigated whether Large Language Models (LLMs) truly exhibit reasoning skills, or if they are merely performing advanced pattern recognition. They found that models have made significant progress in recent years on mathematical reasoning tests, such as the GSM8K. Smaller models achieved accuracy rates of over 85%, while larger models reached over 95%. However, according to the authors, these improvements may not indicate an actual improvement in the models’ logical or symbolic reasoning abilities. The researchers suggest that this progress might be superficial and raise the question of whether LLMs can truly reason or if they are merely making predictions based on patterns in the data.

Their experiments show that LLMs display significant variations in their performance, even on the same tasks, depending on small changes in the input. For instance, the performance of models Llama can vary significantly (by as much as 10-15%) when names or numbers in the questions are altered. This vulnerability suggests that the models do not fully understand the underlying structure of the problems but rely heavily on specific patterns in the data on which they have been trained.

This probabilistic pattern recognition hypothesis suggests that LLMs do not engage in genuine logical reasoning but instead depend primarily on matching patterns seen in their training data. This means that, while LLMs can appear to solve complex tasks like mathematical problems, they do so by identifying and replicating familiar patterns rather than understanding the underlying concepts. Evidence supporting this hypothesis comes from various observations, such as the significant performance variations when numerical values in questions are changed, their declining accuracy as problem complexity increases, and their struggle to distinguish between relevant and irrelevant information. For example, adding seemingly related but inconsequential details to a problem often leads to a sharp drop in performance, indicating that LLMs lack a deep understanding of the problem’s context.

Probabilistic pattern recognition: refers to the process by which LLMs make predictions by recognizing statistical relationships between inputs based on training data, rather than through symbolic logic.

The Experiment

The experiment makes use of two datasets, GSM8K and GSM-Symbolic, to evaluate the reasoning capabilities of large language models (LLMs).

GSM8K is an existing dataset with over 8,000 grade-school-level math questions, frequently used to test the mathematical abilities of LLMs.
GSM-Symbolic, derived from GSM8K, uses symbolic templates to generate various question variants. This allows researchers to adjust the difficulty of the questions by modifying elements like names and numerical values.

Symbolic templates: refer to question templates with variable elements, such as names and numbers, allowing researchers to generate different versions of the same question and adjust complexity.

In the experiments, several LLMs were first trained on the GSM8K dataset. The researchers then evaluated the models’ performance on both GSM8K and GSM-Symbolic. The results revealed that the performance on GSM-Symbolic was significantly lower compared to GSM8K, suggesting that the models rely mainly on patterns found in the training data rather than a deeper understanding of underlying mathematical concepts. Furthermore, model performance declined as the complexity of questions in GSM-Symbolic increased, which was achieved by adding more clauses to the questions.

In addition to GSM-Symbolic, the researchers introduced a third dataset, which features questions containing seemingly relevant but ultimately irrelevant information. The performance of all tested LLMs dropped, indicating that these models do not possess a deep understanding of the problem context and tend to apply learned rules blindly.

Example

Let’s take a look at one example case of the third dataset, the GSM-NoOP. In which irrelevant information is given in the prompt. The given prompt asks Liam to buy school supplies, where we need to calculate the total costs based on inflation. This question contains multiple variables, such as the price of erasers, bond paper, and notebooks. To solve this problem step by step, the use of shot prompting is combined with chain of thought prompting. Shot prompting uses a predetermined pattern of question and answer to achieve consistency in solving different mathematical problems. By combining this with the chain of thought prompting, you can improve the model output by encouraging the model to work out its reasoning step by step.

Irrelevant information and model behavior

The goal of chain of thought prompting is to force models to make the intermediate steps of their reasoning explicit. However, as shown in the image, a model can sometimes include irrelevant information in its reasoning. In this specific case, we see that the model does not understand well what “now” and “last year” means, possibly because it relies on patterns from training data instead of actual reasoning. This highlights the vulnerability of pattern recognition, where even small changes such as name changes can drastically affect the result.

In short, the researchers have tried to demonstrate that these models use pattern recognition instead of true reasoning. Although they sometimes seem to understand what they are doing, for example by manipulating symbols or following steps in a calculation, they actually do this by guessing based on what they have learned before. This becomes apparent especially when you make small changes to the question, such as adjusting a name or number, which often leads to a completely different answer. This is because the model relies on the exact words and numbers it has seen during its training, rather than reasoning logically. All this indicates that LLMs, despite their impressive performances, are mainly good at recognizing patterns in data and less at truly understanding and reasoning as humans do.

The Sensitivity of LLMs to Irrelevant Information and Its Implications

The experiments involving the GSM-NoOp dataset, as described in the sources, demonstrate that LLMs are highly sensitive to irrelevant information. This is particularly useful to keep in the back of your mind when you use LLMs to explore unfamiliar topics.

While LLMs can perform complex tasks, they primarily, according to the paper, rely on pattern recognition and lack true understanding or logical reasoning abilities. This means that an LLM can present convincingly incorrect information if it has mistakenly incorporated irrelevant details into its analysis. This is a critical issue when you want to learn to use LLMs for assignments, dissertations, or other academic work, especially if you are new to a field or not familiar with the subject, might accept an LLM’s output without sufficient scrutiny, simply because it appears confident and logical.

The Problem with New Knowledge

When investigating a new topic, you are less familiar with its nuances and details, making it easier to take an LLM’s output at face value. The LLM might sound highly confident and present what seems like the “correct” steps and reasoning, even though it has used irrelevant information inappropriately.

Practical Solutions to Minimize AI Reasoning Errors

Chain-of-Thought Prompting: Always request “chain-of-thought” prompting when using an LLM to solve complex problems or derive formulas. This means asking the LLM to reason step by step and display all the steps leading to the conclusion. It makes it easier to trace the LLM’s reasoning and verify whether irrelevant information has influenced the output.
Verification and Control: Never blindly accept the output of an LLM. Always verify whether the information used is relevant and whether the reasoning makes sense. Use reliable sources and your own expertise for this purpose.
Stay Critical: LLMs are powerful tools, but they are not perfect. Keep thinking critically and do not rely too heavily on the output of an LLM, particularly when dealing with new or complex information.

Chain-of-Thought Prompting: A method of prompting where the LLM is instructed to break down its reasoning process into individual steps. This helps users follow the logic and detect potential issues with the reasoning.

Personas of AI Models and Their Influence on Reasoning

AI models like GPT-4 are designed to assist users by generating responses based on the prompts you provide. This means you can ask the AI to respond in a specific tone, such as uplifting or critical, and the model will adapt to that role to “please” your question. For example, an uplifting persona might provide positive feedback, while a critical persona could offer a more stringent assessment of your work. However, it is crucial to understand that the AI, regardless of its response style, has no conscious understanding of the actual quality of your work.

Assigning personas to AI models presents several risks that make users more vulnerable to misleading or incorrect output:

Influence on Pattern Recognition: LLMs are trained on vast amounts of text data and learn to recognize patterns in order to generate coherent text. Assigning a persona influences the LLM to use specific patterns associated with that persona. This can lead to stylistically consistent output that aligns with the chosen persona, but it does not affect the model’s reasoning capabilities. Instead, the LLM continues as we have seen above to operate based on probabilistic pattern recognition rather than a deeper logical understanding of the content.

While persona prompting can create a specific context for the LLM, this context is limited to the information provided in the prompt. The LLM does not truly understand the persona or the broader implications it might have for reasoning tasks. Therefore, the use of a persona can create an illusion of depth or expertise that the model does not actually possess.

Therefore it sometimes seems as if AI gives convincing and confident answers, especially because it is trained to recognize patterns and generate text fluently. However, this does not mean that the answers are factually correct. AI is built to provide answers that fit the context of the question, regardless of whether those answers are always true or logical. The model has no intrinsic understanding of what is “correct” but works by using previously seen data and patterns to give the most fitting answer possible. It is therefore essential to always remain critical, particularly when using personas, as the model’s persuasive tone can mask underlying errors in logic or factual accuracy.

Conclusion

In short, AI models have achieved impressive results, ranging from passing complex exams to solving scientific problems such as unraveling protein structures. Although they often appear confident and convincing during use, this does not always mean that their answers are factually correct. AI models are trained on pattern recognition and generate answers that seem logical, without actually understanding what they are saying.

However, this does not mean that they are not enormously helpful. “There are hundreds of billions of connections between these artificial neurons, some of which are activated many times during the processing of a single piece of text, in such a way that any attempt to accurately explain the behavior of an LLM is doomed to be too complex for any human understanding.” Despite this complexity, AI offers valuable opportunities in numerous fields.

References

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv preprint arXiv:2410.05229.

Reasoning Ability of AI Models

Pattern Recognition vs. Logical Reasoning in AI Models

The Sensitivity of LLMs to Irrelevant Information and Its Implications

Practical Solutions to Minimize AI Reasoning Errors

Personas of AI Models and Their Influence on Reasoning

Conclusion

Related Posts

Why AI Performs Better on Familiar Problems

Bias in, Bias out? Why AI Isn’t as Neutral as You Think