The Limitations of AI’s Causal Reasoning: A Multilingual Evaluation of LLMs

By Abigail Thornton, Research Labs Lead at Welo Data, a Welocalize, Inc company. Welocalize, Inc. were finalists in the ‘Best Use of AI in NLP and Translation‘, ‘Best Use of AI in Entertainment‘, ‘Best Use of AI for Healthcare‘ and ‘Best Use of AI in Legal Tech‘ categories at The 2024 A.I. Awards.

Large language models (LLMs) are at the forefront of the current era of artificial intelligence (AI) advancements.

These models excel at generating human-like text and supporting complex decision making, but a key question remains: Can LLMs truly identify cause-and-effect relationships, or do they just demonstrate advanced pattern recognition capabilities?

Causal reasoning—the ability to understand how actions lead to specific outcomes—is essential in various fields where understanding the consequences of actions is paramount. Despite their impressive capabilities, LLMs still struggle with this task, especially across different languages. Welo Data’s Research Lab explores these limitations, highlighting gaps in how these models handle causal reasoning.

The Key Distinction: Causal Reasoning vs Pattern Recognition

To understand why LLMs fall short in causal reasoning, we must first understand the distinction between causal reasoning and pattern recognition.

Causal reasoning refers to the ability to identify and understand cause-and-effect relationships. For instance, consider the statement: “Over the summer, both drownings and ice cream sales increased.” A model with strong causal reasoning should recognize that the underlying cause is the season—summer—not that one event directly caused the other. It must avoid drawing a false causal link between increased ice cream sales and the rise in drownings.
Pattern recognition involves detecting and replicating recurring structures or trends in data. While LLMs are highly effective at identifying these patterns, they often struggle to grasp the underlying causal relationships—especially when confronted with novel or complex scenarios that go beyond surface-level correlations.

While LLMs are adept at mimicking human language and producing text based on patterns, they are not reasoning through the relationships between events. In many cases, they simply recall patterns from their training data.

Limitations in Current AI Benchmarking and Testing Methods

The current benchmarks for testing AI’s reasoning capabilities are far from perfect, as they often use pre-existing datasets or well-known knowledge. While these tests may work for basic cause-and-effect relationships, they fall short when faced with real-world or more complex scenarios. Since the model relies heavily on domain knowledge from the training data, it is unclear whether these LLMs are genuinely reasoning or just recalling patterns it has encountered before.

Moreover, most of these tests are conducted in English, overlooking the diverse linguistic structures found in languages around the world. As the role of AI systems are expected to grow globally, the need for multilingual testing frameworks becomes increasingly important.

Welo Data’s Multilingual Framework for Evaluating Causal Reasoning

Recognizing these gaps, Welo Data developed a new, comprehensive multilingual framework to evaluate LLMs’ causal reasoning abilities. This approach signified a groundbreaking step forward and included:

Multilingual testing: Tests in English, Spanish, Japanese, Korean, Turkish, and Arabic allowed for a thorough assessment of LLMs’ abilities to reason across different grammatical features, word orders, and domains.
Human-crafted, complex scenarios: Creating original, specific scenarios that required causal inference to test whether the LLM could understand the underlying relationships, rather than just simple associations.

Key Findings: AI’s Struggle with Multilingual Causal Reasoning

One of the more striking findings in the research was that LLMs struggled with complex causal reasoning tasks, particularly in scenarios that required multi-step thinking. These models often failed to grasp causal relationships in story-based, contextual reasoning problems, revealing a significant gap in their reasoning abilities.

Another discovery was the inconsistency of AI responses when presented with the same prompt in different languages. The models’ performance varied significantly depending on the language in which they are tested. For instance, models performed better in English and Spanish compared to Turkish and Arabic, indicating that LLMs are better at reasoning in languages they are familiar with.

Implications for AI Development and Global Applications

The research highlights the urgent need for AI systems to reason causally across languages, especially as AI becomes integral to global industries such as healthcare, finance, and customer support. If LLMs are to be relied upon for decision-making in diverse global contexts, they must be able to accurately understand cause-and-effect relationships, regardless of the language they are processing.

To achieve this, AI developers must focus on creating more robust multilingual frameworks that are capable of assessing causal reasoning across a variety of languages. This means moving beyond English-centric training datasets and incorporating a wider range of languages to better reflect complexities that exist in the real-world. Doing so will ensure AI is more accurate, equitable, and effective in global applications.

What Next for LLMs and Causal Reasoning?

Welo Data’s research is just the beginning. The next step is to refine and expand the multilingual framework, adding more languages and increasingly complex reasoning scenarios. By continuing to explore how LLMs handle causal reasoning in different linguistic and cultural contexts, researchers can help improve the models’ performance and reliability.

Welo Data plans to collaborate with AI researchers, linguists, and data scientists to advance the development of AI systems that can reason effectively in any language and context. The goal is to make AI smarter and to ensure that it is truly capable of understanding cause-and-effect relationships worldwide.

Looking Ahead: Unlocking True AI Intelligence

As we push the boundaries of AI’s reasoning capabilities, the future lies in developing systems that can effortlessly operate across diverse languages and contexts, empowering AI to make intelligent, informed decisions globally. The journey ahead is one of ongoing innovation, where genuine understanding — rather than mere pattern recognition — will define the next generation of AI.

About the Author: Abigail Thornton

Abigail Thornton is the Research Labs Lead at Welo Data, leveraging over a decade of experience in linguistics to spearhead cutting-edge projects in natural language processing (NLP). With a Ph.D. in Linguistics from the University of Connecticut, she has dedicated her career to advancing computational linguistics and enhancing language technologies. Abigail focuses on exploring the potential of large language models and improving natural language understanding. Her work aims to develop innovative solutions that bridge theoretical linguistics and practical applications, contributing to AI advancements that support diverse linguistic and cultural contexts. welodata.ai