19.06.26

Most AI models predicted a Reform win in Makerfield. The voters delivered a Labour landslide.

Categories: Salford Business School
Polling station sign

An experiment, led by the 伦福利片鈥檚 Professor Richard Whittle and University College London鈥檚 Dr James Ransom, ran six leading AI (artificial intelligence) models as a synthetic electorate of 42,265 voters on the day of the Makerfield by-election. Four of the six called the seat for Reform, whilst in reality, Andy Burnham, won it for Labour with a significant majority.

On the day Makerfield went to the polls, researchers at the 伦福利片 asked six of the most capable AI models to vote in the by-election, one simulated ballot at a time, across a synthetic electorate of 42,265 voters built from census data.

The results鈥

Four of the six predicted that Reform UK鈥檚 Robert Kenyon would take the seat. The result declared overnight was a Labour victory for Andy Burnham on 54.8%, a majority of 9,231, with Reform on 34.5%.

The most notable finding was not whether the models were right (they mostly were not), but how far they disagreed. Given the same briefing, the same instruction and the same voters, the six returned Reform vote shares ranging from 38.5% to 80.5%, a spread of more than 40 points. One, GPT-5.4-mini, came closest to the real outcome, placing Burnham on 60.7% against an actual 54.8%. Others predicted a Reform landslide of up to 80%.

Makerfield by-election chart

Figure 1. The declared result against two full-electorate simulations. GPT-5.4-mini tracked the real outcome closely, while Grok-4.3 predicted a Reform landslide.

Every model reproduced familiar demographic patterns, with older voters and men more likely to back Reform. The levels were often far from reality, and most missed what decided the contest, namely Burnham鈥檚 personal standing, the tactical consolidation of anti-Reform voters behind Labour, and the new Restore Britain party, which took 6.8%. The models compressed a 14 name ballot to two or three contenders, writing off the Greens, Liberal Democrats and Conservatives, each of whom lost their deposit.

Approaching with caution鈥

Professor Richard Whittle described the exercise less as a forecasting tool than as a caution for anyone tempted to use these systems to read public opinion. He said: 鈥淭he headline for anyone thinking of using these models to gauge public opinion is a simple one. Which model you ask matters more than how you ask it.

鈥淲e gave six systems an identical electorate and an identical briefing, and they produced everything from a large Labour win to a Reform supermajority.鈥

鈥淲hat the models are good at is the texture, the age gap and the gender gap turn up every time. What they are poor at is the politics, the things that swing a by-election, a strong local candidate, tactical voting, a new party entering the field. A synthetic electorate built from demographics alone votes its stereotypes, and Makerfield did not,鈥 continued University College London鈥檚 Dr James Ransom, lead author of the study.

The researchers note that one model performed well. 鈥淕PT-5.4-mini came close to the real result.

鈥淏ut you only learn which model was right after the votes are counted and, on the day, you would have had no principled way of choosing between them. A one-in-three chance of calling the winner is not a method anyone would want to brief a campaign on,鈥 concluded Richard.

______________________________________________________________________________________________

The paper. Makerfield By-Election: LLM Voting Simulation, a briefing note by James Ransom (UCL) and Richard Whittle (Salford Business School, 伦福利片)18 June 2026. Available on SSRN (paper 6963681).

Method in brief. A synthetic electorate of 42,265 voters (assumed 55% turnout) was built once from 2021 Census and ONS marginals for sex, age, ethnicity and household deprivation. Each voter was presented to every model as a one-line persona alongside a constant briefing and the full fourteen-candidate ballot, with candidate order randomised on every call to neutralise position effects. Sampling used a temperature of 1.0 at minimal reasoning effort.

Models tested. GPT-5.4-mini, Claude Sonnet 4.6, Claude Haiku 4.5, Grok-4.3, Gemini-3.1-flash-lite and DeepSeek-v4-flash. Four ran across the full electorate; two ran as one per cent pilots (Claude Sonnet 4.6 and Claude Haiku 4.5) and are indicative only. Among the four full-electorate runs, only GPT-5.4-mini called the result correctly.

For all press office enquiries please email communications@salford.ac.uk.