EPL Quant Analysis — Case Study

Sample output – Player Analysis

What the report looks like in practice.

The final Summary node synthesises all four SWOT dimensions into a narrative-style brief with embedded statistics — closely resembling a professional sports journalist's analytical output, while remaining fully grounded in deterministic data processing.

Sample AI-generated report · Cole Palmer · Chelsea · FW,MF

Illustrative output

Strengths

League leader in G+A (33) in debut Chelsea season — highest combined contribution of any player in 2023–24

Exceptional playmaking for a forward: 11 assists demonstrates elite vision beyond pure goalscoring

Goal output (22) significantly exceeds xG (13.4), indicating elite finishing and shot selection

Weaknesses

Performance heavily tied to Chelsea's possession system — may regress if tactical context changes

Limited Champions League experience at this level; sustained performance across European competition unproven

Small sample: one breakout season; consistency across multiple years not yet established

Opportunities

Continued development under a settled Chelsea tactical setup could push G+A even higher

England international role — increased visibility in major tournaments could elevate market value further

Upside if Chelsea strengthen around him: improved support cast could unlock even higher assist totals

Threats

League-wide xG overperformance of +8.7 goals may partially revert — some regression expected

Chelsea's squad instability and frequent managerial changes create tactical uncertainty

Increased opposition attention in Year 2 — defences will adapt their approach specifically to him

Sample Output – Comparative Analysis

The season's standout attackers.

All 580 players ranked by total goal contributions. The leaderboard is dominated by forwards — but the most balanced contributor, Ollie Watkins, made it to the top with 13 assists as a striker.

# Player Goals Assists G+A

1

Cole Palmer

FW,MF · Chelsea

22

11

33

2

Erling Haaland

FW · Manchester City

27

5

32

3

Ollie Watkins

FW · Aston Villa

19

13

32

4

Mohamed Salah

FW · Liverpool

18

10

28

5

Phil Foden

FW,MF · Manchester City

19

8

27

6

Son Heung-min

FW · Tottenham Hotspur

17

10

27

7

Bukayo Saka

FW · Arsenal

16

9

25

8

Alexander Isak

FW · Newcastle United

21

2

23

9

Jarrod Bowen

FW · West Ham United

16

6

22

10

Dominic Solanke

FW · Bournemouth

19

3

22

Sample Output – Cluster Analysis

Twenty teams. Four tactical profiles.

Each team's goals for and goals against were compared to league averages to assign a tactical quadrant — revealing whether teams were built to dominate, grind, or simply survive the season.

⬆ Good Attack · ⬇ Good Defence

Manchester City, Arsenal, Liverpool, Aston Villa — The elite tier. Above-average in both dimensions, these teams dominated the table through tactical balance.

⬆ Good Attack · ⬆ Bad Defence

Chelsea, Newcastle United, Tottenham Hotspur, Crystal Palace — High-scoring but leaky. Entertainment value high; title contention limited.

⬇ Bad Attack · ⬇ Good Defence

Manchester United, Fulham, Brighton, Brentford, West Ham — Defensively solid, struggling to create. Survival through structure rather than firepower.

⬇ Bad Attack · ⬆ Bad Defence

Everton, Bournemouth, Wolves, Nottm Forest, Luton Town, Burnley, Sheffield United — Struggling on both ends. Relegation candidates and survival battles.

The data

Two Kaggle datasets. One merged picture.

The analysis draws on player-level and team-level data merged by team name — a step that required manual correction before any analysis could begin.

Primary dataset

premier-player-23-24

Individual player statistics for all 580 Premier League players in the 2023–24 season. Includes goals, assists, expected goals (xG), expected assists (xAG), progressive carries, minutes played, cards, and per-90 metrics.

580 rows 32 columns 20 teams

Secondary datasets

pl_table_2023_24 + supplementary

Final league table with goals scored/conceded per team, plus supplementary datasets on match results (381 matches), player tackles won, player red cards, and team foul/yellow card rates by match.

381 matches 303 tackle records 55 red card records

Data engineering challenge

Three team names didn't match between datasets: Brighton vs Brighton & Hove Albion, Bournemouth vs AFC Bournemouth, and Wolverhampton vs Wolverhampton Wanderers. Manually resolved before merge. The scoresStr column (e.g. "66-42") was split into separate Team_Goals_For and Team_Goals_Against numeric fields, and Player_Pos was derived by taking only the first-listed position from multi-position entries (e.g. "MF,FW" → "MF").

The AI layer

Stats in. Narrative out. Built for the people who need it.

Statistical outputs are accurate but not actionable for coaches, journalists, or scouts without interpretation. The Azure AI Prompt Flow takes Python-computed metrics and feeds them to a GPT-4o pipeline that generates structured SWOT reports — the kind of analysis a junior analyst would spend hours writing, delivered in seconds through Azure AI Studio's low-code interface.

🐍

Python analysis

Pandas computes G+A, xG vs Gls delta, positional breakdowns, and team quadrant classification

→

⚡

Azure AI Prompt Flow

Structured metrics passed through sequential GPT-4o LLM nodes — one per SWOT dimension — before final synthesis

→

📋

SWOT report

Coaches, analysts, and journalists receive a plain-language tactical brief with embedded statistics — no stats background required

Prompt Flow architecture

A cloud-native, agentic workflow built on Azure.

The system is built around a deterministic Python backbone paired with sequential LLM reasoning — ensuring accurate metric computation happens before any AI interpretation occurs. Player and team statistics are stored in Azure Blob Storage as .csv files, retrieved securely at runtime via a Python node that loads the data into pandas for preprocessing.

Azure AI Prompt Flow graph — inputs branch into Strength, Weakness, Opportunity and Threat LLM nodes, converge into Summary, and output the final report

Azure AI Studio · Prompt Flow Graph inputs → S · W · O · T nodes → Summary → outputs

☁️

Azure Blob Storage

Player and team-level statistics stored as .csv files in Azure Blob Storage. The Python node retrieves the dataset using secure account credentials at runtime — keeping sensitive credentials out of the codebase.

🐍

Python node — deterministic backbone

Filters data for the target team, computes KPIs — goals, assists, xG, xA, defensive indicators — and produces structured summaries passed downstream. No LLM reasoning touches the data until this step is complete.

💬

Two primary inputs

The flow accepts the target Premier League team name and optional chat history — preserving analytical context across multi-turn interactions so follow-up questions build on prior analysis naturally.

🤖

GPT-4o via Azure OpenAI

Each SWOT dimension is powered by a dedicated GPT-4o LLM node via Azure OpenAI. The final summary node synthesises a narrative-style tactical brief that reads like a professional sports journalist's analysis — grounded in deterministic data.

LLM node design

Four lenses. One tactical picture.

Each GPT-4o node interprets the Python-computed team metrics through a specific tactical lens. The nodes run in parallel from the same inputs, then converge into a final Summary node that synthesises a cohesive narrative with embedded statistics.

✓ Strength

Highlights dominant tactics and key contributors — which players are overperforming xG, which positions drive the most G+A, and what tactical patterns define the team's best performances.

− Weakness

Surfaces vulnerabilities and inefficiencies — teams underconverting chances, defensive frailties revealed by goals conceded vs xGA, and positional gaps in contribution metrics.

→ Opportunity

Identifies growth potential and exploitable matchups — where a team's xG profile suggests upside, or where upcoming fixture lists create favourable conditions for improvement.

⚠ Threat

Assesses external risks — key player injury exposure, schedule congestion, tactical overreliance on one contributor, or opponent-specific weaknesses that could be exploited against this team.

Key design feature

The system accepts contextual natural-language inputs alongside the team name — instructing the LLM nodes to provide explanatory analysis of specific questions posed by the user, with supporting graphs and visualisations generated from the dataset embedded directly in the output. This closes the loop between raw statistical analysis and the kind of tailored insight a coach or journalist would actually ask for.

Mathematical Framework

Four questions. Real statistical tests where it counts.

Research Question 1

Which teams had the highest average expected goals (xG), and how does that compare with their actual goals scored?

Grouped by team, computed average xG and actual goals per player. Reveals which teams are converting chances above expectation (overperformers) and which are leaving goals on the pitch.

Man City — avg xG / Gls3.28 xG → 3.76 Gls ↑

Arsenal — avg xG / Gls3.12 xG → 3.44 Gls ↑

Liverpool — avg xG / Gls3.01 xG → 2.67 Gls ↓

Everton — avg xG / Gls2.16 xG → 1.54 Gls ↓

Phil Foden — xG overperformer10.3 xG → 19 Gls (+8.7)

Finding: High xG generally correlates with more goals, but conversion varies significantly. Everton left the most goals on the pitch. Foden was the season's biggest xG overperformer at +8.7 goals above expectation.

Research Question 2

How does player contribution (goals + assists) differ across positions — DF, MF, FW?

Grouped by primary position, computed mean G+A. Ran a one-tailed t-test to confirm whether midfielders contribute significantly more than defenders — a distinction that informs how clubs value positional roles.

FW — mean G+A6.70

MF — mean G+A3.47

DF — mean G+A1.81

GK — mean G+A0.05

H₀: MF = DF mean G+ANull hypothesis

t-statistic4.29

p-value0.0000132

Null hypothesis rejected. Midfielders contribute significantly more G+A than defenders (p << 0.05). The difference is not noise — it's structural.

Research Question 3

Who are the top 10 players with the highest combined goals and assists?

Sorted all 580 players by total G+A. The resulting leaderboard confirms which forwards and attacking midfielders carried the most attacking weight for their teams across the full season.

#1 Cole Palmer (Chelsea)22G + 11A = 33

#2 Erling Haaland (Man City)27G + 5A = 32

#3 Ollie Watkins (Aston Villa)19G + 13A = 32

#4 Mohamed Salah (Liverpool)18G + 10A = 28

#5 Phil Foden (Man City)19G + 8A = 27

Finding: Cole Palmer led the league in G+A in his debut season at Chelsea. Haaland led outright in goals (27). Watkins was the most balanced contributor: second in assists despite being a striker.

Research Question 4

How can we categorize teams based on total goals scored and conceded?

Compared each team's goals for/against to league averages, classifying all 20 teams into four quadrants. A second t-test confirmed that teams with above-average xG score significantly more goals in practice.

H₀: high xG = low xG teams in actual GlsNull hypothesis

t-statistic3.22

p-value0.0036

Null hypothesis rejected. Teams with higher xG score significantly more goals (p = 0.0036). xG is a meaningful predictor of offensive output at team level.

Personal reflection

Behind the build

The most interesting tension in this project was the gap between what the statistics say and what a coach or journalist actually needs to hear. A Cpk of 0.396 means nothing without context. Neither does a p-value of 0.0000132 — unless someone translates it into a decision. The Prompt Flow SWOT layer was built to close that gap: taking outputs that are statistically rigorous and making them accessible to people who don't read pandas DataFrames for fun.

The xG analysis surfaced something genuinely surprising. Liverpool, despite being one of the top xG generators in the league, underperformed their expectation significantly — while Manchester City and Arsenal converted above it. That kind of efficiency differential doesn't show up in the final league table in an obvious way, but it explains a lot about how close the title race actually was on a per-chance basis.

The broader lesson: combining deterministic statistical analysis with LLM narrative synthesis isn't just a technical pattern — it's a communication strategy. The numbers build credibility. The narrative builds understanding. Neither works as well without the other.

How can AI aidsports analytics?