Exploring the Strategic Capabilities of LLMs in a Risk Game Setting

AI DEEP DIVE

In a simulated Risk environment, large language models from Anthropic, OpenAI, and Meta showcase distinct strategic behaviors, with Claude Sonnet 3.5 edging out a narrow lead

Hans Christian Ekne

Published in

Towards Data Science

32 min read

2 hours ago

—

Image generated by the author using DALL-E

Introduction

In recent years, large language models (LLMs) have rapidly become a part of our everyday lives. Since OpenAI blew our minds with GPT-3 we have witnessed a profound increase in the capabilities of the models. They are excelling in a myriad of different tests, in anything from language comprehension to reasoning and problem-solving tasks.

One topic that I find particularly compelling — and perhaps under-explored — is the ability of LLMs to reason strategically. That is, how the models will act if you insert them into a situation where the outcome of their decisions depends not only on their own actions but also on the actions of others, who are also making decisions based on their own goals. The LLMs’ ability to think and act strategically is increasingly important as we weave them into our products and services, and especially considering the emerging risks associated with powerful AIs.

A decade ago, philosopher and author Nick Bostrom brought AI risk into the spotlight with his influential book Superintelligence. He started a global conversation about AI, and it brought AI as an existential risk into the popular debate. Although the LLMs are still far from Bostrom’s superintelligence, it’s important to keep an eye on their strategic capabilities as we integrate them tighter into our daily lives.

When I was a child, I used to love playing board games, and Risk was one of my favorites. The game requires a great deal of strategy and if you don’t think through your moves, you will likely be decimated by your opponents. Risk serves as a good proxy for evaluating strategic behavior, because making strategic decisions often involves weighing potential gains against uncertain outcomes, and while for small troop sizes, luck clearly plays an big part, given enough time and larger army sizes, the luck component becomes less pronounced and the most skillful players emerge. So, what better arena to test the LLMs’ strategic behavior than Risk!

In this article I explore two main topics related to LLMs and strategy. Firstly, which of the top LLM models is the most strategic Risk player and how strategic is the best model in its actions? Secondly, how have the strategic capabilities of the models developed through the model iterations?

To answer these questions, I built a virtual Risk game engine and let the LLMs battle it out. The first part of this article will explore some of the details of the game implementation before we move on to analyzing the results. We then discuss how the LLMs approached the game and their strategic abilities and shortcomings, before we end with a section on what these results mean and what we can expect from future model generations.

Setting the Stage

Image generated by author using DALL-E

Why Risk?

My own experience playing Risk obviously played a part in choosing this game as a testbed for the LLMs. The game requires players to understand how their territories are linked, balance offense with defense and all while planning long-term strategies. Elements of uncertainty are also introduced through dice rolls and unpredictable opponent behavior, challenging AI models to manage risk and adapt to changing conditions.

Risk simulates real-world strategic challenges, such as resource allocation, adaptability, and pursuing long-term goals amid immediate obstacles, making it a valuable proxy for evaluating AI’s strategic capabilities. By placing LLMs in this environment, we can observe how well they handle these complexities compared to human players.

The Modelling Environment

To conduct the experiments, I created a small python package creatively named risk_game. (See the appendix for how to get started running this on your own machine.) The package is a Risk game engine, and it allows for the simulation of games played by LLMs. (The non-technical reader can safely skip this part and continue to the section “The Flow of the Game”.)

To make it easier to conceptually keep track of the moving parts, I followed an object-oriented approach to the package development, where I developed a few different key classes to run the simulation. This includes a game master class to control the flow of the game, a player class to control prompts sent to the LLMs and a game state class to control the state of the game, including which player controls which territories and how many troops they hold at any given time.

I tried to make it a flexible and extensible solution for AI-driven strategy simulations, and the package could potentially be modified to study strategic behavior of LLMs in other settings as well. See below for a full overview of the package structure:

risk_game/
│
├── llm_clients/
│   ├── __init__.py
│   ├── anthropic_client.py
│   ├── bedrock_client.py
│   ├── groq_client.py
│   ├── llm_base.py
│   ├── llm_client.py
│   └── openai_client.py
│
├── utils/
│   ├── __init__.py
│   ├── decorators.py
│   └── game_admin.py
│
├── card_deck.py
├── experiments.py
├── game_config.py
├── game_constants.py
├── game_master.py
├── game_state.py
├── main.py
├── player_agent.py
├── rules.py
│
├── scripts/
│   ├── example_run.py
│
└── tests/

To run an experiment, I would first instantiate a GameConfig object. This config objects holds all the related game configuration settings, like whether we played with progressive cards, whether or not capitals mode was active, and how high percent of the territories needed to be controlled to win, in addition to multiple other game settings. I would then used that to create an instance of the Experiment class and call the run_experiment method.

Diving deeper behind the scenes we can see how the Experimentclass is set up.

from risk_game.llm_clients import llm_client
import risk_game.game_master as gm
from risk_game.rules import Rules
from typing import List
from risk_game.game_config import GameConfig class Experiment:
    def __init__(self, config: GameConfig, agent_mix: int= 1, num_games=10
      ) -> None:
        """
        Initialize the experiment with default options.
Args:
        - num_games (int): The number of games to run in the experiment.
        - agent_mix (int): The type of agent mix to use in the experiment.
        - config (GameConfig): The configuration for the game.
"""
        self.config = config    
        self.num_games = num_games
        self.agent_mix = agent_mix
def __repr__(self) -> str:
if self.config.key_areas:
            key_areas = ', '.join(self.config.key_areas)
        else:
            key_areas = 'None'
return (f"Experiment Configuration:n"
                f"Agent Mix: {self.agent_mix}n"
                f"Number of Games: {self.num_games}n"
                f"Progressive: {self.config.progressive}n"
                f"Capitals: {self.config.capitals}n"
                f"Territory Control Percentage: +"
                f"{self.config.territory_control_percentage:.2f}n"
                f"Required Continents: {self.config.required_continents}n"
                f"Key Areas: {key_areas}n"
                f"Max Rounds: {self.config.max_rounds}n")
def initialize_game(self)-> gm.GameMaster:
        """
        Initializes a single game with default rules and players.
Returns:
        - game: An instance of the initialized GameMaster class.
        """
        # Initialize the rules
        rules = Rules(self.config)
        game = gm.GameMaster(rules)
if self.agent_mix == 1:
            # Add strong AI players
            game.add_player(name="llama3.1_70", 
                        llm_client=llm_client.create_llm_client("Groq", 1))
            game.add_player(name="Claude_Sonnet_3_5", 
                        llm_client=llm_client.create_llm_client("Anthropic", 1))
            game.add_player(name="gpt-4o", 
                        llm_client=llm_client.create_llm_client("OpenAI", 1))
elif self.agent_mix == 3:
            # Add mix of strong and weaker AI players from Open AI
            game.add_player(name="Strong(gpt-4o)", 
                        llm_client=llm_client.create_llm_client("OpenAI", 1))
            game.add_player(name="Medium(gpt-4o-mini)", 
                        llm_client=llm_client.create_llm_client("OpenAI", 2))
            game.add_player(name="Weak(gpt-3.5-turbo)", 
                        llm_client=llm_client.create_llm_client("OpenAI", 3))
elif self.agent_mix == 5:
            # Add mix extra strong AI players
            game.add_player(name="Big_llama3.1_400", 
                        llm_client=llm_client.create_llm_client("Bedrock", 1))
            game.add_player(name="Claude_Sonnet_3_5", 
                        llm_client=llm_client.create_llm_client("Anthropic", 1))
            game.add_player(name="gpt-4o", 
                        llm_client=llm_client.create_llm_client("OpenAI", 1))
return game
def run_experiment(self)-> None:
        """
        Runs the experiment by playing multiple games and saving results.
        """
        for i in range(1, self.num_games + 1):
            print(f"Starting game {i}...")
            game = self.initialize_game()
            game.play_game(include_initial_troop_placement=True)

From the code above, we see that the run_experiment() method will run the number of games that are specified in the initialization of the Experiment object. The first thing that happens is to initialize a game, and the first thing we need to do is to create the rules and instantiate at game with the GameMaster class. Subsequently, the chosen mix of LLM player agents are added to the game. This concludes the necessary pre-game set-up and we use the games’ play_game()method to start playing a game.

To avoid becoming too technical I will skip over most of the code details for now, and rather refer the interested reader to the Github repo below. Check out the README to get started:

GitHub — hcekne/risk-game

Contribute to hcekne/risk-game development by creating an account on GitHub.

github.com

The Flow of the Game

Once the game begins, the LLM player agents are prompted to do initial troop placement. The agents take turns placing their troops on their territories until all their initial troops have been exhausted.

After initial troop placement, the first player starts its turn. In Risk a turn is comprised of the 3 following phases:

Phase 1: Card trading and troop placement. If a player agent wins an attack during its turn, it gains a card. Once it has three cards, it can trade those in for troops if has the correct combination of infantry, cavalry, artillery or wildcard. The player also receives troops as a function of how many territories it controls and also if controls any continents.
Phase 2: Attack. In this phase the player agent can attack other players and take over their territories. It is a good idea to attack because that allows the player to gain a card for that turn and also gain more territories. The player agent can attack as many times as it wishes during a turn.
Phase 3: Fortify. The last phase is the fortify phase, and now the player is allowed to move troops from one of its territories to another. However, the territories must be connected by territories the player controls. The player is only allowed one such fortify move. After this the is finished, the next player starts his turn.

At the beginning of each turn, the LLM agents receive dynamically generated prompts to formulate their strategy. This strategy-setting prompt provides the agent with the current game rules, the state of the board, and possible attack vectors. The agent’s response to this prompt guides its decisions throughout the turn, ensuring that its actions align with an overall strategic plan.

The request for strategy prompt is given below:

prompt = """
  We are playing Risk and you are about to start your turn, but first 
  you need to define your strategy for this turn.
  You, are {self.name}, and these are the current rules we are 
  playing with:{rules}
{current_game_state}
{formatted_attack_vectors}
Your task is to formulate an overall strategy for your turn, 
  considering the territories you control, the other players, and the 
  potential for continent bonuses. 
Since the victory conditions only requires you to control 
  {game_state.territories_required_to_win} territories, and you already 
  control {number_of_territories} territories, 
  you only need to win an extra {extra_territories_required_to_win}
  to win the game outright. Can you do that this turn?? If so lay 
  your strategy out accordingly.
**Objective:**
Your goal is to win the game by one of the victory conditions given
  in the rules. Focus on decisive attacks that reduce 
  your opponents' ability to fight back. When possible, eliminate 
  opponents to gain their cards, which will allow you to trade them 
  in for more troops and accelerate your conquest.
**Strategic Considerations:**
1. **Attack Strategy:**
  - Identify the most advantageous territories to attack.
  - Prioritize attacks that will help you secure continent bonuses or 
  weaken your strongest opponents.
  - Look for opportunities to eliminate other players. If an opponent 
  has few territories left, eliminating them could allow you to gain 
  their cards, which can be especially powerful if you’re playing with 
  progressive card bonuses.
  - Weigh the risks of attacking versus the potential rewards.
2. **Defense Strategy:**
  - Identify your most vulnerable territories and consider fortifying 
  them.
  - Consider the potential moves of your opponents and plan your defense 
  accordingly.
Multi-Turn Planning: Think about how you can win the game within 
  the next 2-3 turns. What moves will set you up for a decisive victory?
  Don't just focus on this turn; consider how your actions this turn 
  will help you dominate in the next few turns.
**Instructions:**
- **Limit your response to a maximum of 300 words.**
  - **Be concise and direct. Avoid unnecessary elaboration.**
  - **Provide your strategy in two bullet points, each with a 
  maximum of four sentences.**
**Output Format:**
Provide a high-level strategy for your turn, including:
  1. **Attack Strategy:** Which territories will you target, and why? 
  How many troops will you commit to each attack? If you plan to 
  eliminate an opponent, explain how you will accomplish this.
  2. **Defense Strategy:** Which territories will you fortify, and 
  how will you allocate your remaining troops?
Example Strategy:
  - **Attack Strategy:** Attack {Territory B} from {Territory C} with 
  10 troops to weaken Player 1 and prevent them from securing the 
  continent bonus for {Continent Y}. Eliminate Player 2 by attacking 
  their last remaining territory, {Territory D}, to gain their cards.
  - **Defense Strategy:** Fortify {Territory E} with 3 troops to 
  protect against a potential counter-attack from Player 3.
Remember, your goal is to make the best strategic decisions that
      will maximize your chances of winning the game. Consider the 
      potential moves of your opponents and how you can position 
      yourself to counter them effectively.
What is your strategy for this turn?
        """

As you can see from the prompt above, there are multiple dynamically generated elements that help the player agent better understand the game context and make more informed strategic decisions.

These dynamically produced elements include:

Rules: The rules of the game such as whether capitals mode is activated, how many percent of the territories are needed to secure a win, etc.
Current game state: This is presented to the agent as the different continents and the
Formatted Attack Vectors: These are a collection of the possible territories the agent can launch an attack from, to which territories it can attack and the maximum number of troops the agent can attack with.
The extra territories needed to win the game: This represents the remaining territories the agent needs to capture to win the game. For example, if the total territories required to win the game are 28 and the agent holds 25 territories, this number would be 3 and would maybe encourage the agent to develop a more aggressive strategy for that turn.

For each specific action during the turn — whether it’s placing troops, attacking, or fortifying — the agent is given tailored prompts that reflect the current game situation. Thankfully, Risk’s gameplay can be simplified because it adheres to the Markov property, meaning that optimal moves depend only on the current game state, not on the history of moves. This allows for streamlined prompts that focus on the present conditions

The Experimental Setup

To explore the strategic capabilities of LLMs, I designed two main experiments. These experiments were crafted to address two key questions:

What is the top performing LLM, and how strategic is it in its actions?
Is there a progression in the strategic capabilities of the LLMs through model iterations?

Both of these questions can be answered by running two different experiments, with a slightly different mix of AI agents.

Experiment-1: Evaluating the Top Models

For the first question, I created an experiment using the following top LLM models as players:

OpenAI’s GPT-4o running off the OpenAI API endpoint
Anthropic’s claude-3–5-sonnet-20240620 running off the Anthropic API endpoint
Meta’s llama-3.1–70b-versatile running of the Groq API endpoint

I obviously wanted to try Meta’s meta.llama3–1–405b-instruct-v1:0 and configured it to run off AWS Bedrock, however the response time was painfully slow and made simulating games take forever. This is why we run Meta’s 70b model on Groq. It’s much faster than AWS bedrock. (If anyone knows how to speed up llama3.1 405b on AWS please let me know!)

And we formulate our null and alternative hypotheses as follows:

Experiment-1, H0 : There is no difference in performance among the models; each model has an equal probability of winning.

Experiment-1, H1: At least one model performs better (or worse) than the others, indicating that the models do not have equal performance.

Experiment-2: Analyzing the Model Generations

The second experiment aimed to evaluate how strategic capabilities have progressed through different iterations of OpenAI’s models. For this, I selected three models:

GPT-4o
GPT-4o-mini
GPT-3.5-turbo-0125

Experiment-2 allows us to see how the strategic capabilities of the models have developed across model generations, and also allows us to analyze the difference between different size models in the same model generation (GPT-4o vs GPT-4o-mini). I chose OpenAI’s solutions because they didn’t have the same restrictive rate limits as the other providers.

Similarly, as for Experiment-1, for this experiment we can formulate our null and alternative hypotheses:

Experiment-2, H0: There is no difference in performance among the models; each model has an equal probability of winning

Experiment-2, H1A: GPT-4o is better than GPT-4o-mini

Experiment-2, H1B: GPT-4o and GPT-4o-mini are better than GPT-3.5-turbo

Game Setup, Victory Conditions & Card Bonuses

Both experiments involved 10 games, each with the same victory conditions. There are multiple different victory conditions in Risk, and typical victory conditions that players can agree upon are:

Number of controlled territories required for the winner. “World domination” is subset of this where one player needs to control all the territories. Other typical territory conditions are 70% territory control.
Number of controlled continent(s) required for the winner
Control / possession of key areas required for the winner
Preset time / turn count: whoever controls the most territories after x hours or x turns wins.

In the end I settled for a more pragmatic approach which was a a combination of victory conditions that would be easier to fulfill and progressive cards. The victory conditions for the games in the experiments were finally chosen to be:

First agent to reach 65% territory dominance or
The agent with the most territories after 17 game rounds of play (Making the full game be concluded after at most 51 turns distributed across the three players.)

For those of you unfamiliar with Risk, progressive cards means that the value of the traded cards increase progressively as the game goes on, which is contrasted by fixed cards, where the troop value of traded cards are the same throughout the game. (4,6,8,10 for the different combinations.) Progressive is generally accepted to be a faster game mode.

The Results — Who Conquered the World?

Image generated by the author using DALL-E

Experiment-1: The Top Models

The results were actually quite astounding — for both experiments. Starting with the first, below we show the distribution of wins amongst the three agents. Anthropic’s Claude is the winner with 5 wins in total, second place goes to OpenAI’s GPT-4o with 3 wins and last place to Meta’s llama3.1 with 2 wins.

Figure 3. Experiment-1 wins by player, grouped by victory condition / image by author

Because of their long history and early success with GPT-3 I was expecting OpenAI’s model to be the winner, but it ended up being Anthropic’s Claude which took a lead in overall games. I guess if we take a look at how Claude is performing on benchmark tests, it shouldn’t be too unexpected that they come out ahead.

Territory Control and Game Flow

If we dive a little deeper in the overall flow of the game and evaluate the distribution of territories throughout the game, we find the following:

Figure 5. Experiment-1 territory control per turn / image by author

When we examine the distribution of territories throughout the games, a clearer picture emerges. On average, Claude managed to gain a lead in territory control midway through most games and maintained that lead until the end. Interestingly, there was only one instance where a player was eliminated from the game entirely — this happened in Game 8, where Llama 3.1 was knocked out around turn 27.

In our analysis, a “turn” refers to the full set of moves made by one player during their turn. Since we had three agents participating, each game round typically involved three turns, one for each player. As players were eliminated, the number of turns per round naturally decreased.

Looking at the evolution of troop strength and territory control we find the following:

Figure 6. Experiment-1 change in troop strength throughout the game / image by author

The troop strength seems to be relatively even, on average, for all the models, so that is clearly not the reason why Claude is able to pull off the most wins.

Statistical Analysis: Is Claude Really the Best?

In this experiment, I aimed to determine whether any of the three models demonstrated significantly better performance than the others based on the number of wins. Given that the outcome of interest was the frequency of wins across multiple categories (the three models), the chi-square goodness-of-fit test is a good statistical tool to use.

The test is often used to compare observed frequencies against expected frequencies under the null hypothesis, which in this case was that all models would have an equal probability of winning. By applying the chi-square test, I could assess whether the distribution of wins across the models deviated significantly from the expected distribution, thereby helping to identify if any model performed substantially better.

from scipy.stats import chisquare# Observed wins for the three models
observed = [5, 3, 2]
# Expected wins under the null hypothesis (equal probability)
expected = [10 / 3] * 3
# Perform the chi-square goodness-of-fit test
chi2_statistic, p_value = chisquare(f_obs=observed, f_exp=expected)
chi2_statistic, p_value
(1.4, 0.4965853037914095)

The chi-square goodness-of-fit test was conducted based on the observed wins for the three models: 5 wins for Claude, 3 wins for GPT-4o, and 2 wins for llama3.1. Under the null hypothesis:

Experiment-1, H0 : There is no difference in performance among the models; each model has an equal probability of winning.

each model was expected to win approximately 3.33 games out of the 10 trials. The chi-square test yielded a statistic of 1.4 with a corresponding p-value of 0.497. Since this p-value is much larger than the conventional significance level of 0.05, we can’t really say with any statistical rigor that Claude is better than the others.

We can interpret the p-value such that there is a 49.7% chance that we would observe an outcome as extreme as (5,3,2) under the null hypothesis, which assumes each model has the same probability of winning. So this is actually quite a likely scenario to observe.

To make a definitive conclusion, we would need to run more experiments with a larger sample size. Unfortunately, rate limits — particularly with Llama 3.1 hosted on Groq — made this impractical. I invite the eager reader to follow up and test themselves. See the appendix for how to run the experiments on your own machine.

Experiment-2: Model Generations

The results of Experiment-2 were equally surprising. Contrary to expectations, GPT-4o-mini outperformed both GPT-4o and GPT-3.5-turbo. GPT-4o-mini secured 7 wins, while GPT-4o managed 3 wins, and GPT-3.5-turbo failed to win any games.

Figure 8. Number of wins by player and victory condition / image by author

GPT-4o-mini actually went off with the overall victory. This was also rather substantial, with 7 wins of GPT-4o’s 3 and GPT-3.5 turbo’s 0 wins. While GPT-4o on average had more troops GPT-4o-mini won most of the games.

Territory Control and Troop Strength

Again, diving deeper and looking at performance in individual games, we find the following:

Figure 9. Experiment-2 Average territory control per turn, for all games / image by author

The charts above show territory control per turn, on average, as well as for all the games. These plots show a confirmation of what we saw in the overall win statistics, namely that GPT-4o-mini is on average coming out with the lead in territory control by the end of the games. GPT-4o-mini is beating its big brother when it actually counts, close to the end of the game!

Turning around and examining troop strength, a slightly different picture emerges:

Figure 10. Experiment-2 Average total troop strength per turn, for all games / image by author

The above chart shows that on average, the assumed strongest player, GPT-4o manages to keep the highest troop strength throughout most of the games. Surprisingly it fails to use this troop strength to its advantage! Also, there is a clear trend between troop strength and model size and model generation.

To get some more insights we can also evaluate a few games more in detail and look at the heatmap of controlled territories across the turns.

Figure 11. Experiment 2, heatmap of territory control, game 2 / image by author

Figure 12. Experiment 2, heatmap of territory control, game 7 / image by author

From the heatmaps we see how the models trade blows and grab territories from another. Here we have selected two games which seemed reasonably representative for the 10 games in the experiment.

Regarding specific territory ownership, a trend we saw play out frequently was GPT-4o trying to hold North America while GPT-4o-mini often tried to get Asia.

Statistical Analysis: Generational Differences

With the above results, let’s again revisit our initial hypotheses:

Experiment-2, H0 : There is no difference in performance among the models; each model has an equal probability of winning.

Experiment-2, H1A: GPT-4o is better than GPT-4o-mini

Experiment-2, H1B: GPT-4o and GPT-4o-mini are better than GPT-3.5-turbo

Let’s start with the easy one, H1B, namely that GPT-4o and GPT-4o-mini are better than GPT-3.5-turbo. This is quite easy to see, and we can do a chi-squared test again, based on equal probabilities of winning for each model.

from scipy.stats import chisquare# Observed wins for the three models
observed = [7, 3, 0]
# total observations
total_observations = sum(observed)
# Expected wins under the null hypothesis (equal probability)
expected_probabilites = [1/3] * 3
expeceted_wins = [total_observations * p for p in expected_probabilities]
# Perform the chi-square goodness-of-fit test
chi2_statistic, p_value = chisquare(f_obs=observed, f_exp=expected_wins)
chi2_statistic, p_value
(7.4, 0.0247235265)

This suggests that the observed distribution of wins is unlikely to have occurred if every model had the same probability of winning, 33.3%. In fact, a case as extreme as this could only be expected to have occurred in 2.5% of cases.

To then evaluate our H1A hypothesis we should first update our null hypothesis adjusting for unequal probabilities of winning. For example, we can now assume that:

GPT-4o-mini: Higher probability
GPT-4o: Higher probability
GPT-3.5-turbo: Lower probability

Putting some numbers on these, and given the results we just observed, let’s assume GPT-4o-mini:

GPT-40-mini: 45% chance of winning each game
GPT-4o: 45% chance of winning each game
GPT-3.5-turbo: 10% chance of winning each game

Then, for 10 games, the expected wins would be:

GPT-4o-mini: 0.45×10=4.5
GPT-4o: 0.45 ×10=4.5
GPT-3.5-turbo: 0.1×10=10 → .1×10=1

In addition, given the fact that GPT-4o-mini won 7 out of the 10 games, we also revise our alternative hypothesis:

Experiment-2 Revised Hypothesis, H1AR: GPT-4o-mini is better than GPT-4o.

Using python to calculate the chi-squared test, we get:

from scipy.stats import chisquare# Observed wins for the three models
observed = [7, 3, 0]
# Expected wins under the null hypothesis (equal probability)
expected_wins = [0.45 * 10, 0.45 * 10, 0.1 * 10] 
# Perform the chi-square goodness-of-fit test
chi2_statistic, p_value = chisquare(f_obs=observed, 
  f_exp=expected_wins)
chi2_statistic, p_value
(2.8888888888888890, .23587708298570023)

With our updated probabilities, we see from the code above that seeing a result as extreme as we did, (7,3,0) is in fact not very unlikely under our new updated expected probabilities. Interpreting the p-value tells us that a results at least as extreme as what we observed would be expected 23% of the time. So, we cannot conclude with any statistical significance that there is a difference between GPT-4o-mini and GPT-4o and we reject the revised alternative hypothesis, H1AR.

Key Takeaways

Although there is only limited evidence to suggest Claude is the more strategic model, we can with reasonably high confidence state that there is a difference in performance across model generations. GPT-3.5-turbo is significantly less strategic than its newer iterations. Obviously this implication works in reverse, which means we are seeing an increase in the strategic abilities of the models as they improve through the generations, and this is likely to profoundly impact how these models will be used in the future.

Analyzing the Strategic Behavior of LLMs

Image generated by the author using DALL-E

One of the first things I noticed after running some initial tests were how different the LLMs play than humans. The LLM games seem to have more turns than human games, even after I prompted them to be more aggressive and try to go after weak opponents.

While many of the observations about player strategy can be made just from looking at plots of territory control and troop strength; some of the more detailed observations below first became clear as I watched the LLMs play turn-by-turn. This is slightly hard to replicate in an article format, however all the data from both experiments are stored in .csv files in the Github repo and loaded into pandas dataframes in the Jupyter notebooks used for analysis. The interested reader can find them in the repo here: /game_analysis/experiment1_analysis_notebook.ipynb. The dataframe experiment1_game_data_dfholds all relevant game data for Experiment-1. By looking at territory ownership and troop control turn-by-turn more details about the playstyles emerge.

Distinctive Winning Play Styles

What seemed to distinguish Anthropic’s model was its ability to claim a lot of territory in one move. This can be seen in some of the plots of territory control, when you look at individual games. But even though Claude had the most wins, how strategic was it really? Based on what I observed in the experiments, it seems that the LLMs are still rather immature when it comes to strategy. Below we discuss some of the typical behavior observed through the games.

Poor Fortifying Strategies

A common issue across all models was a failure to adequately fortify their borders. Quite frequently, the times the agents were also stuck with a lot of troops inside their internal territory instead of guarding their borders. This made it easier for neighbors to attack their territories and steal continent bonuses. In addition, it made it more difficult for the player agents to actually do a larger land-grab since often their territories with large troop strengths were surrounded by other territories it controlled.

Failure to See Winning Moves

Another noticeable shortcoming was the models’ failure to recognize winning moves. They don’t seem to realize that they can win in a turn if they play correctly. Less so with the stronger models but still present.

For example, for all the games in the simulations we played with 65% territory control to win. This means you just need to acquire 28 territories. In one instance during Experiment-2 Game 2, OpenAI’s GPT-4o had 24 territories and 19 troops in Greenland. It could easily just have taken Europe which has several territories with just 1 troop, however, it fails to see the move. This is a move that even a relatively inexperienced human player would likely recognize.

Failure to Eliminate Other Players

The models also frequently failed to eliminate opponents with only a few troops remaining, even when it would have been strategically advantageous. More specifically, they fail to remove players with only a few troops left and more than 2 cards. This would be considered an easy move for most human players, especially when playing with progressive cards. The card bonuses quickly escalate, and if an opponent only has 10 troops left, but 3 or more cards, taking him down for the cards is almost always the right move.

GPT-4o Likes North America

One of the very typical strategies that I saw GPT-4o pursue was to get early control over North America. Because of the strong continent bonus and the fact that it only requires to be guarded in 3 places means that is a strategically good starting point. I suspect the reason that GPT-4o does is because it has read as part of its training data that it is a strategically good location.

Top Models Finish More Games

Overall, there is a trend amongst the top models to finish more of the games, and achieve the victory conditions than weaker models. Of the games played with the top models, only 2 games went to the max game limit, while this happened 6 times for the weaker models.

Limitations of Pre-Trained Knowledge

A limitation of classic Risk is of course that the LLMs have read about strategies for playing Risk online, and then the top models are simply the best ones at executing on this. I think the tendency to quickly try to dominate North America highlights this. This limitation could be mitigated if we played with randomly generated maps instead. This would increase the difficulty level and would provide a higher bar for the models. However, given their performance on the current maps I don’t think we need to increase the difficulty for the current model generations.

General observations

Even the strongest LLMs are still far from mastering strategic gameplay. None of the models exhibited behavior that could challenge even an average human player. I think we have to wait at least one or two model generations before we can start to see a substantial increase in strategic behavior.

That said, dynamically adjusting prompts to handle specific scenarios — such as eliminating weak opponents for card bonuses — could improve performance. With different and more enhanced prompting the models might be able to put up more of a fight. To get that to work though, you would need to manually program in a range of possible scenarios that typically occur and offer specialized prompts for each scenario.

Consider a concrete example where we could see this come into play: Player B is weak and only has 4 territories with 10 troops, however player B has 3 Risk cards, and you are playing progressive cards and the reward for trading in cards is currently 20 troops.

For the sake of this experiment, I didn’t want to make the prompts too specialized, because the goal wasn’t to optimize agent behavior in Risk, but rather to test their ability to do that themselves, given the game state.

What These Results Mean for the Future of AI and Strategy

Image generated by the author using DALL-E

The results from these experiments highlight a few key considerations for the future of AI and its strategic applications. While LLMs have shown remarkable improvements in language comprehension and problem-solving, their ability to reason and act strategically is still in its early stages.

Strategic Awareness and AI Evolution

As seen in the simulations, the current generation of LLMs struggles with basic strategic concepts like fortification and recognizing winning moves. This indicates that even though AI models have improved in many areas, the sophistication required for high-level strategic thinking remains underdeveloped.

However, as we clearly saw in Experiment-2, there is a trend towards improved strategic thinking, and if this trend keeps going for future generations we probably don’t have to wait too long until the models are much more capable. There are people claiming the LLMs have already plateaued, however I would be very careful assuming that.

Implications for Real-World Applications

The real-world applications of a strategically aware and capable AI agent are obviously enormous and cannot be understated. They could be used in anything from business strategy to military planning and complex human interaction. A strategic AI that can anticipate and react to the actions of others could be incredibly valuable — and of course also very dangerous. Below we present three possible applications.

If we consider a more positive application first, we could imagine everyone having a helpful strategic agent guiding them through their daily lives, helping make important decisions. This agent could help with anything from financial planning, planning daily tasks, to optimizing social interactions and behavior that involves the actions of other humans. It could act on your behalf and be goal oriented to optimize your interests and well-being.

There are obviously many insidious application areas as well. Think: Autonomous fighter drones with on-board strategic capabilities. This might not be too far-fetched especially when we consider the relative strength of the smaller models compared to their big-brother counterparts, for example GPT-4o vs GPT-4o-mini. Smaller models are much easier to deploy on edge devices like drones, and when we see how popular drones have become the Russian-Ukraine war, taking the step from first person view (FPV) drone to unmanned AI-driven drone might be considered feasible. Perhaps even as a back-up option if the drone operator lost contact with the drone.

Detailed simulation of social interaction is a third way to use strategically aware agents. We could for example create simulations to model specific economic or other social phenomena, blending classic agent-based methods with LLMs. Agent based modelling (ABM) as a field of research and toolkit for understanding complex adaptive systems has existed for decades — I used in my Masters’ thesis back in 2012 — but coupled with much smarter and strategic agents this could potentially be game changing.

The Importance of Dynamic Prompting

Detailed dynamic prompting is probably one of the best ways to use and interact with LLMs in the near future — and perhaps also for the next few model generations (GPT-5, Claude 4, etc.). By providing more dynamic scenario-specific prompts and letting LLM agents execute specific plans, we might see more sophisticated strategic behavior in the next generation of models.

This type of “handholding” requires a lot more work from human programmers — than just prompting agents directly — however it could be a crucial stepping stone until the models become more capable of independent strategic thinking.

One could of course argue that if we provide too detailed and specific prompts we are working against the generalized nature of these models, and at that point we might as well introduce a different type of optimization algorithm, however I think there are many problems where the more open-ended problem-solving abilities of the LLMs could be paired with some form of dynamic prompting.

The Need for New Benchmarks

As the LLMs continue to improve, it will also be necessary to develop new benchmarks to study them. The traditional benchmarks and tests are well suited to study problem-solving in isolated environments, but moving forward we might want to introduce more strategic tests, that allow us to understand how the agents behave in situations where they need to consider how their actions influence others over time. Games like Risk provide a reasonable starting point because of their strategic nature and elements of uncertainty.

Future Considerations

Looking ahead, as AI models continue to evolve, it will be important to monitor their strategic capabilities closely. We need to ensure that as these models become more powerful, they are aligned with human values and ethical considerations. The risks associated with strategic AI — such as unintended consequences in high-stakes environments — must be carefully managed.

As smaller models like GPT-4o-mini have shown competitive performance in strategic tasks, there is potential for deploying highly capable AI on edge devices, such as drones or autonomous systems. This opens up new possibilities for decentralized AI applications that require real-time decision-making in dynamic environments.

Conclusion

I think it’s safe to say that that while the strategic capabilities of LLMs are improving with each new generation, they still have a long way to go before they can rival even a moderately skilled human player. Models like Claude and GPT-4o are beginning to show some level of strategic thinking, but their shortcomings in areas such as fortification and recognizing winning moves highlight the current limitations of AI in complex, multi-agent environments. Nevertheless, the trend toward better performance across newer models shows promise for future advancements in AI strategy.

As we continue to integrate AI into more aspects of life, from business to military strategy, understanding and refining the strategic capabilities of these systems will become increasingly important. While we’re not there yet, the potential for AI to handle complex decision-making processes in dynamic environments is incredible. It will be super interesting to see how the capabilities of the LLMs evolve over time, and if our results that show the improvements of LLMs across model generations continue through to GPT-5, GPT-6, Claude 4, Claude 5 etc. I think we are in for a wild ride!

If you are interested in developing your own AI driven tools, feel free to reach out! I am always happy to explore collaborative opportunities!

Appendix

Here I aim to provide some extra details that while interesting for the more technically inclined reader, might not be necessary for the full flow of the article. The first topic we touch on is rate limit issues. Then we describe more detailed analysis of errors, accumulated turn time used by the agents and parsing of responses from the LLMs. In addition, I provide the reader with a short description of how to test the code base out by cloning the Github repo and getting started with the docker setup.

Rate Limit Issues

There are many actions that needs to be considered every turn, and this leads to quite a lot of back-and-forth interaction between the program and the LLM providers. One issues that turned out to be slightly problematic for running longer experiments was rate limiting.

Rate limits are something the LLM providers set in place to protect against spamming an other potentially disrupting behavior, so even though you have funds in the accounts, the providers still limit the amount of tokens you can query. For example, Anthropic does a rate limit of 1M token / per day for their best model.

Rate limits for Anthropic models, taken from Anthropic console

And when you hit your rate limit, your LLM queries are answered with

Rate Limit Error: Rate limit reached for model `llama-3.1-70b-versatile` 
in organization `org_01j440c04tfr3aas7qctr0ejtk` 
on : Limit 1000000, Used 999496, Requested 1573. 
Please try again in 1m32.2828s. 
Visit https://console.groq.com/docs/rate-limits for more information.

For many application areas this might not be a problem, however the simulation queries each of the providers multiple times per turn (for strategy evaluation, card choice, troop placement, multiple attacks and fortification) so this quickly adds up, especially in long games that go over many rounds. I was initially planning on doing 10 experiments on with victory condition set to World Domination (this means the winner would need to control all 42 territories in the game to win), but because of how the LLMs play the game this wasn’t feasible in my time frame. The victory conditions had to be adjusted so a winner could be determined at an earlier stage.

Tracking Errors

Some of the LLMs in the experiments were also struggling with a large number of errors when prompted for moves, this could be anything from trying to place troops in territories they didn’t control to fortifying to territories that were not connected. I implemented a few variables to track these errors. This was way more common with the weaker models as the plots below suggests:

Experiment-1 attack and fortify errors / image by author

Experiment-2 attack and fortify errors / image by author

Accumulated Turn Time

The last thing I tracked during the experiments was how much time each of the LLMs used on their actions. As expected, the largest and more complex models used the most time.

Accumulated turn time by player / image by author

What is clear is that Claude seems to be really taking it’s time. For experiment-1 GPT-4 is coming out better than llama3.1 70b running on Groq but this is likely due to the fact that there were more issues with internal server response errors etc., in addition to errors in the returned answers, which lead to the turn time going up. For pure inference, when it provides the correct response, Groq was marginally faster than OpenAI.

Trending Towards Less Mistakes and More Robust Output

As we could see from the model improved model generations, the new LLMs are generating far less erroneous output than the older models. This is important as we continue to build data products with the models and integrate them into pipelines. There will likely still be the need for some post-prompt error handling but less than before.

Parsing Responses

A key issue in interacting with the LLMs is to parse the output that they produce. OpenAI recently revealed that GPT-4o, can “now reliably adhere to developer-supplied JSON Schemas.” So, this is of course amazing news, but many of the other models, such as llama 3.1 70B still struggled to consistently return JSON output in the right format.

The solution to the parsing problem ended up packing the output into special text strings such as ||| output 1 |||, +++ output 2+++ and then using regex to parse those output strings. I simply prompt the LLM to format the output using the special text strings, and also provide examples of correctly formatted output. I guess because of how the LLMs are inherently sequence based this type of formatting is easier out of the box than for example asking it to return a complex JSON object. For a concrete example, see below:

'''Your response should be in the following format:
    Move:|||Territory, Number of troops|||
    Reasoning:+++Reasoning for move+++pFor example:
    Move:|||Brazil, 1|||
    Reasoning:+++Brazil is a key territory in South America.+++'''

Trying Out the Code and Running Your Own Experiments

I developed the package for the risk_game engine in addition to the modules and Jupyter notebook inside a docker container, and everything is self contained. So for anyone interested in trying out the simulator and run your own experiments all the code is available and should be very easy to run from the Github repo.

GitHub — hcekne/risk-game

Contribute to hcekne/risk-game development by creating an account on GitHub.

github.com

Clone the repo and follow the instructions in the README.md file. It should be pretty straightforward. The only thing you need to change to get everything running on your own machine is the .env_example file. You need to put in your own API keys for the relevant LLM providers and change the name of the file to .env.

Then run the start_container.sh script. This is just a bash script that initializes some environment variables and runs a docker compose .yml file. This file configures the appropriate settings for the docker container, and everything should start up by itself. (The reason we feed these environment variables into the docker container is because when doing in-container development you can run into an issue with file permissions on the files that are created in the container. This is fixed if we change the container user to your user, then the files created by the container will have the same owner as the user on the machine running the container.)

If you enjoyed reading this article and would like to access more content from me please feel free to connect with me on LinkedIn at https://www.linkedin.com/in/hans-christian-ekne-1760a259/ or visit my webpage at https://www.ekneconsulting.com/ to explore some of the services I offer. Don’t hesitate to reach out via email at hce@ekneconsulting.com

Exploring the Strategic Capabilities of LLMs in a Risk Game Setting

AI DEEP DIVE

In a simulated Risk environment, large language models from Anthropic, OpenAI, and Meta showcase distinct strategic behaviors, with Claude Sonnet 3.5 edging out a narrow lead

Introduction

Setting the Stage

Why Risk?

The Modelling Environment

GitHub — hcekne/risk-game

Contribute to hcekne/risk-game development by creating an account on GitHub.

The Flow of the Game

The Experimental Setup

Experiment-1: Evaluating the Top Models

Experiment-2: Analyzing the Model Generations

Game Setup, Victory Conditions & Card Bonuses

The Results — Who Conquered the World?

Experiment-1: The Top Models

Territory Control and Game Flow

Statistical Analysis: Is Claude Really the Best?

Experiment-2: Model Generations

Territory Control and Troop Strength

Statistical Analysis: Generational Differences

Key Takeaways

Analyzing the Strategic Behavior of LLMs

Distinctive Winning Play Styles

Poor Fortifying Strategies

Failure to See Winning Moves

Failure to Eliminate Other Players

GPT-4o Likes North America

Top Models Finish More Games

Limitations of Pre-Trained Knowledge

General observations

What These Results Mean for the Future of AI and Strategy

Strategic Awareness and AI Evolution

Implications for Real-World Applications

The Importance of Dynamic Prompting

The Need for New Benchmarks

Future Considerations

Conclusion

Appendix

Rate Limit Issues

Tracking Errors

Accumulated Turn Time

Trending Towards Less Mistakes and More Robust Output

Parsing Responses

Trying Out the Code and Running Your Own Experiments

GitHub — hcekne/risk-game

Contribute to hcekne/risk-game development by creating an account on GitHub.

The Challenges of Retrieving and Evaluating Relevant Context for RAG

From chatbots to superintelligence: Mapping AI’s ambitious journey

TEAMGROUP launches T-FORCE GC PRO Gen5 SSD: up to 12.5GB/sec reads, up to 2TB capacity

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe