In Short:
A new study by Apple engineers reveals that advanced AI models like those from OpenAI and Google have unreliable mathematical reasoning. By tweaking standard math problems, they found these models struggled, showing accuracy drops of up to 65% when inconsequential details were added. This suggests current models rely on pattern matching rather than true understanding, questioning their reasoning abilities.
Companies such as OpenAI and Google have recently been promoting advanced “reasoning” capabilities in their latest artificial intelligence models as a significant advancement in the field. However, a new study conducted by a team of six engineers from Apple reveals that the mathematical reasoning exhibited by sophisticated large language models (LLMs) is often fragile and can be unreliable when faced with minor alterations to standard benchmark problems.
Research Findings
The fragility outlined in this new research supports earlier studies suggesting that LLMs rely on probabilistic pattern matching and lack the formal understanding of fundamental concepts necessary for robust mathematical reasoning. The researchers postulate, “Current LLMs are not capable of genuine logical reasoning; rather, they replicate reasoning steps observed in their training data.”
Experiment Methodology
In the paper titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” which is presently available as a preprint, the Apple researchers utilized a standardized collection of over 8,000 grade-school level mathematical word problems from GSM8K, which serves as a benchmark for evaluating complex reasoning abilities of modern LLMs. They adopted a novel approach by dynamically replacing specific names and numbers within this testing set to create a new question format. For example, a query regarding Sophie receiving 31 building blocks could be transformed into one about Bill receiving 19 building blocks.
This method mitigates potential “data contamination” that could arise from directly inputting static GSM8K questions into an AI model’s training dataset. Importantly, these incidental modifications do not alter the mathematical difficulty, which means that models should theoretically perform comparably on both GSM-Symbolic and GSM8K tests.
Performance Analysis
However, when the researchers assessed over 20 state-of-the-art LLMs on GSM-Symbolic, they found an average accuracy decline across the board compared to GSM8K, with performance drops ranging between 0.3 percent and 9.2 percent depending on the model. Furthermore, they noted significant variance in accuracy across 50 separate trials of GSM-Symbolic, with discrepancies of up to 15 percent observed within the same model, indicating that alterations to numbers typically led to worse accuracy than modifications to names.
The high variability present in GSM-Symbolic outcomes was unexpected, as the researchers emphasized that “the overall reasoning steps needed to solve a question remain the same.” Such variability suggests that these models rely on pattern matching rather than engaging in formal reasoning.
Implications of Findings
Despite these inconsistencies, the overall accuracy displayed during GSM-Symbolic tests remained relatively high. For instance, OpenAI’s ChatGPT-4o achieved a decline from 95.2 percent accuracy on GSM8K to a commendable 94.9 percent on GSM-Symbolic. This indicates a strong success rate on both benchmarks, despite questions regarding the model’s use of formal reasoning (though many models experienced significant drops in accuracy when one or two additional logical steps were introduced).
However, the LLMs performed notably worse when the researchers altered the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential details” to the questions. This newly created set, referred to as GSM-NoOp (short for “no operation”), presented scenarios where extraneous information was incorporated, such as stating that “five of the kiwis were smaller than average.”
The inclusion of these misleading details led to what the researchers described as “catastrophic performance drops” in accuracy relative to GSM8K, with declines ranging from 17.5 percent to an astounding 65.7 percent depending on the tested model. Such dramatic reductions in accuracy underscore the limitations inherent in relying solely on basic pattern matching for transforming statements into actions without a true understanding of their meanings.