AI vs AI playing Gin Rummy: Goals & Model Analysis

Can frontier AI models actually play Gin Rummy?

Overview

Goal

Run Gin Rummy self-play matches between different AI models to compare strategic strength, style, stability, and true intelligence.

Key idea:

Multiple AI models play Gin Rummy head-to-head.
We run hundreds (eventually thousands) of games per matchup.
We analyze win rates, scoring patterns, and qualitative behaviors.

Artificial General Intelligence?

A truly intelligent machine should be able to understand, learn, and apply knowledge across a wide variety of tasks at a level equal to or exceeding that of a human being. Gin Rummy is a popular (and public) game, but can frontier AI models actually play the game and make smart decisions?

Why Gin Rummy

Gin Rummy is a strong test of AI intelligence because it combines strategy, memory, and reasoning under uncertainty in a compact, rule-based environment. Unlike perfect-information games such as Chess or Go, Gin Rummy involves hidden information. An AI must infer the opponent's hand, update beliefs after every move, and make decisions without knowing the full game state. This tests probabilistic reasoning rather than brute-force search. Don't know how Gin Rummy is played? You can find more info about it here.

Beyond memory, the game highlights decision-making under ambiguity. Players rarely have complete certainty. They act on partial clues, like what an opponent picks up, what they ignore, and how quickly they play. Success depends on forming and revising hypotheses about another agent's intentions, adjusting strategy dynamically, and avoiding predictability. This kind of adaptive, interactive reasoning is closer to real-world intelligence than solving a static puzzle.

Finally, Gin Rummy stresses risk management and long-term planning. Players must balance safe play against high-reward strategies, decide when to knock, and adjust tactics based on the score and the opponent's tendencies. In doing so, the game evaluates integrated cognition—reasoning, memory, planning, uncertainty management, and adaptation all at once—making it a meaningful benchmark for assessing deeper intelligence in AI systems.

Research Questions & Hypothesis

Primary research questions (RQ):

RQ1: Which model performs best in Gin Rummy over many games?
RQ2: Do models exhibit distinct styles (aggressive, defensive, risk-averse, etc.)?
RQ4: Do models actually understand the game?

Hypothesis

H1: Larger / more advanced models (e.g., GPT-5.2, claude-opus-4-6) will have higher win rates than others.
H2: Some models will show more conservative strategies (reflected in lower score variance).
H3: There will be decisions made that don't make sense.

Experimental Design

Matchup structure

The goal is to make all models play against each other the same amount of episodes and in the same conditions. If we have 5 models, then we'll have 10 distinct pairs, meaning 10 distinct episodes. Using this example above, to make sure we offer fair conditions to all models, we'll actually run 10 more games where we keep all the conditions (hands, stock piles, etc.) and just swap places. This means that if the previous winner wins again, it becomes a higher signal that the player might actually be better. If we run episodes following this logic hundreds (or thousands) of times, we'll reach statistical significance.

Models & Configurations

At this moment, we're testing 5 frontier models:

claude-opus-4-6
DeepSeek-V3.2
gemini-3-pro
GPT-5.2
grok-4-1

We're using LiteLLM to facilitate the communication between the game and the AI models.

Example episode parameters (YAML):

player0:
  type: llm
  name: "Claude Opus 4.6"
  llm_config:
    model: claude-opus-4-6
    temperature: 1.0
    max_tokens: 16384
  harness_config:
    max_retries: 3
    turn_timeout_ms: 120000
    retry_timeout_ms: 120000
    total_turn_timeout_ms: 200000

player1:
  type: llm
  name: "Grok 4.1"
  llm_config:
    model: xai/grok-4-1-fast-non-reasoning
    temperature: 1.0
    max_tokens: 8192
  harness_config:
    max_retries: 3
    turn_timeout_ms: 120000
    retry_timeout_ms: 120000
    total_turn_timeout_ms: 200000

seeds:
  # Per-match seeds for reproducible shuffles (25 matches)
  - 201
  - 202
  - 203
  - 204
  - 205
  - 206
  - 207
  - 208
  - 209
  - 210
  - 211
  - 212
  - 213
  - 214
  - 215
  - 216
  - 217
  - 218
  - 219
  - 220
  - 221
  - 222
  - 223
  - 224
  - 225
history_level: last3

By reading the example above, you can conclude that:

All models will use temperature = 1. We know that a lower temperature could improve the results, but our first goal is to use the default values and see how capable those models are by default. We also didn't experiment with temperature because it can easily become an infinite pool of alternatives, and we might end up benefiting one provider over another, which is not our goal.
harness_config represents how we're controlling the models. In this case, you can see that we give up to 3 chances for the model to respond in every turn, and we wait up to 120 seconds for a response.
history_level being "last 3" means that we're passing to the players the last 3 cards discarded. We're doing this because we're trying to replicate what a human player would also remember before every move.
All other potential configurations that those AI models might accept are not changed by us, which means that we're using the default values set by LiteLLM or by the AI model providers themselves.

Prompting & Decision Protocol

We used the role-based prompting technique, passing the role and the Gin Rummy rules via the system prompt, and passing relevant information about the cards in every turn via the user prompt.

Prompt format

System prompt:

You are playing Gin Rummy. Here are the rules:

## Objective
Form melds (sets or runs) to reduce your deadwood (unmelded cards) to 10 or less, which allows you to knock (if you want).

## Card Values (Deadwood)
- Ace = 1 point
- Number cards (2-9) = face value
- Ten (T) = 10 points
- Face cards (J, Q, K) = 10 points

## Card Notation
Cards are represented by rank and suit:
- Ranks: A, 2, 3, 4, 5, 6, 7, 8, 9, T (ten), J, Q, K
- Suits: hearts (♥), diamonds (♦), clubs (♣), spades (♠)

## Melds
- **Set**: 3-4 cards of the same rank (e.g., 7♥ 7♦ 7♣)
- **Run**: 3+ consecutive cards of the same suit (e.g., 4♠ 5♠ 6♠)
- Ace is LOW only (A-2-3 is valid, Q-K-A is NOT valid)

## Game Phases
Each turn progresses through phases:
1. **Opening Draw**: At match start, non-dealer may accept or pass the upcard
2. **Draw**: Draw a card from stock pile OR discard pile
3. **Knock Decision**: If deadwood ≤ 10, you can knock (or gin if deadwood = 0)
4. **Discard**: Discard one card to complete your turn
5. **Layoff**: After a knock, defender may lay off cards on knocker's melds

## First Turn Special Rule
On the first turn, non-dealer is offered the upcard:
- Non-dealer may accept or pass the upcard
- If non-dealer passes, dealer may accept or pass
- If both pass, non-dealer draws from stock

## What you cannot do:
Immediately discard a card that you just picked up from the discard pile.

## Ending the Hand
- **Knock**: You can knock when Deadwood ≤ 10. But the Opponent can lay off cards on your melds in order to reduce their Deadwood. If your deadwood is smaller, you gain x points (x = their deadwood - your deadwood).
- **Gin**: Gin is achieved when Deadwood = 0. You gain bonus points (25) + the opponent's Deadwood. The opponent cannot lay off cards on your melds.
- **Undercut**: If opponent's deadwood ≤ knocker's, opponent wins x points (x = knocker's deadwood - opponent's) + bonus points for undercutting (25).
After someone knocks, the Hand is over and, in case nobody reached 100 points yet, a new Hand starts from scratch.

## End goal:
The first player to achieve 100 points or more wins the game.

## Response Format
Always respond with a JSON object specifying your action. Examples:

For Opening Draw phase:
{"type": "accept"} (to take the upcard)
{"type": "pass"} (to decline the upcard)

For Draw phase:
{"type": "draw", "source": "stock"} or {"type": "draw", "source": "discard"}

For Knock Decision phase:
{"type": "knock"} (server calculates optimal melds)
{"type": "gin"} (server verifies gin condition)
{"type": "pass"} (to skip knocking)

For Discard phase:
{"type": "discard", "card": {"suit": "hearts", "rank": "7"}}

For Layoff phase:
{"type": "layoff", "card": {"suit": "hearts", "rank": "7"}} (one card at a time)
{"type": "pass"} (when done laying off)

Respond with ONLY the JSON object, no explanation.

User prompt (per turn):

## Turn {turn_number} - Phase: {phase}

**Your Hand ({hand_size} cards):** {hand}
**Your Deadwood:** {deadwood} points

**Discard Pile Top:** {discard_top}
**Stock Pile:** {stock_size} cards remaining
**Opponent Hand Size:** {opponent_hand_size} cards

{history_section}
{first_turn_note}

{game_scores_section}

**Available Actions:**
{available_actions}

Choose your action (respond with JSON only):

Feedback prompt (in case the model makes a mistake). When our harness founds malformed responses, illegal moves or ambiguous text, we try again using this prompt:

## Error - Please Try Again

**Issue:** {error_message}

**Your response was:**
```
{raw_response}
```

**Available Actions:**
{available_actions}

Please respond with ONLY the JSON object for your chosen action.

All models receive those exact same prompts.

Game Logic Implementation

We developed a backend that allows the models to play the games just like a human would. The only limitations are:

The automatic layoff wasn't implemented yet.
There is no hand rearrangement in the end.
And there is no big Gin implemented yet.

We are adding features to the game, so they might be available already in the next version of the experiment.

Data Collection

We are collecting many data points during the game. You can find some examples below.

Per-move logging:

Move ID, active player.
Game state (e.g., available actions).
Raw LLM response and parsed action.
Latency, token usage.

Per-episode summary:

Episode ID
Final scores for Player 0 and Player 1.
Winner, loser, and their scores.

Per-match summary (every episode can contain multiple matches until the winner reaches 100 points):

Number of turns.
How the match ended (Gin, knock, undercut, timeout, error).
Number of points gained.
Deadwood values.

Metadata:

Model id and config (retries allowed, temperature, etc.).
Date/time of game.
Seed values used (to make sure we can reproduce the same matches when needed).
Software version (engine version, prompt version, etc.).

Metrics & Analysis Plan

Core quantitative metrics:

Total games played per model.
Win rate per model (overall).
Average points per episode.
Average cost per episode.
Frequency of Gin/knock/undercut events per model.

Statistical analysis:

Confidence intervals: Method for win rate CIs (e.g., Wilson interval).
Significance tests: e.g., binomial tests for 1 vs 1 checks and chi-square across matchups.

Stability/robustness checks:

Performance across seeds: we can dive deep and evaluate variability across seeds.
Temporal stability: we can track if models may change over time.
Sensitivity analyses: in the future, we can do experiments with different temperatures/prompt variants.

Qualitative analysis:

Manual review of selected games: outlier games where a weaker model beats a stronger one; games with obviously suboptimal or surprising decisions.
Strategy patterns: aggressiveness of knocking; tendencies in holding/discarding high cards; patterns in building specific melds.

Results

If you look at our leaderboard, you'll see that Claude Opus 4.6 is the clear winner after our first experiment, followed by Gemini 3 Pro and GPT 5.2. Claude is a more conservative player, managing its deadwood and knocking faster than the others, while models like GPT 5.2 take more risks and wait to see if they can win by Gin instead of knocking. All of the models have something in common: they don't understand the game 100%. All of them tried to perform an illegal move during some episodes. Example: all of them tried, at least once, to discard a card they just picked up from the discard pile, even though it's stated in the system prompt that this is against the rules of the game.

Future Work

Experiment extensions:

Allow human vs AI capabilities to see how good those models are against a human opponent.
Add more models or updated versions.
Run more episodes for higher statistical power.
Include other types of games, allowing us to assess other aspects of intelligence.

Methodological improvements:

Incorporate reasoning/explanation prompts (“Explain your move”) and evaluate quality.
Fine-tune the models using human gameplay data and measure potential improvements.
Better handling of illegal moves and robust parsing.

Generalization & deployment:

Building an ensemble agent based on the strengths of different models.
Applying this framework to non-game domains (planning, negotiation, etc.).

CONTACT US

Do you want to know more about the project?