Sports Analytics · Python · Azure AI

How can AI aid
sports analytics?

Statistical Python analysis of 580 Premier League players across the 2023–24 season, combined with an Azure AI Prompt Flow that generates LLM-driven SWOT reports — giving coaches, analysts, and journalists something they rarely have: structured insight, fast.

Python · Pandas · Seaborn · Matplotlib Azure AI Prompt Flow Statistical hypothesis testing 580 players · 20 teams xG analysis LLM SWOT reports
580
Players across
20 teams analyzed
4
Research questions
with hypothesis tests
6
Datasets merged
for analysis
2
Significant p-values
below 0.05
Sample output – Player Analysis

What the report looks like in practice.

The final Summary node synthesises all four SWOT dimensions into a narrative-style brief with embedded statistics — closely resembling a professional sports journalist's analytical output, while remaining fully grounded in deterministic data processing.

Sample AI-generated report · Cole Palmer · Chelsea · FW,MF
Illustrative output
League leader in G+A (33) in debut Chelsea season — highest combined contribution of any player in 2023–24
Exceptional playmaking for a forward: 11 assists demonstrates elite vision beyond pure goalscoring
Goal output (22) significantly exceeds xG (13.4), indicating elite finishing and shot selection
Performance heavily tied to Chelsea's possession system — may regress if tactical context changes
Limited Champions League experience at this level; sustained performance across European competition unproven
Small sample: one breakout season; consistency across multiple years not yet established
Continued development under a settled Chelsea tactical setup could push G+A even higher
England international role — increased visibility in major tournaments could elevate market value further
Upside if Chelsea strengthen around him: improved support cast could unlock even higher assist totals
League-wide xG overperformance of +8.7 goals may partially revert — some regression expected
Chelsea's squad instability and frequent managerial changes create tactical uncertainty
Increased opposition attention in Year 2 — defences will adapt their approach specifically to him

Sample Output – Comparative Analysis

The season's standout attackers.

All 580 players ranked by total goal contributions. The leaderboard is dominated by forwards — but the most balanced contributor, Ollie Watkins, made it to the top with 13 assists as a striker.

# Player Goals Assists G+A
1
Cole Palmer
FW,MF · Chelsea
22
11
33
2
Erling Haaland
FW · Manchester City
27
5
32
3
Ollie Watkins
FW · Aston Villa
19
13
32
4
Mohamed Salah
FW · Liverpool
18
10
28
5
Phil Foden
FW,MF · Manchester City
19
8
27
6
Son Heung-min
FW · Tottenham Hotspur
17
10
27
7
Bukayo Saka
FW · Arsenal
16
9
25
8
Alexander Isak
FW · Newcastle United
21
2
23
9
Jarrod Bowen
FW · West Ham United
16
6
22
10
Dominic Solanke
FW · Bournemouth
19
3
22

Sample Output – Cluster Analysis

Twenty teams. Four tactical profiles.

Each team's goals for and goals against were compared to league averages to assign a tactical quadrant — revealing whether teams were built to dominate, grind, or simply survive the season.

⬆ Good Attack · ⬇ Good Defence
Manchester City, Arsenal, Liverpool, Aston Villa — The elite tier. Above-average in both dimensions, these teams dominated the table through tactical balance.
⬆ Good Attack · ⬆ Bad Defence
Chelsea, Newcastle United, Tottenham Hotspur, Crystal Palace — High-scoring but leaky. Entertainment value high; title contention limited.
⬇ Bad Attack · ⬇ Good Defence
Manchester United, Fulham, Brighton, Brentford, West Ham — Defensively solid, struggling to create. Survival through structure rather than firepower.
⬇ Bad Attack · ⬆ Bad Defence
Everton, Bournemouth, Wolves, Nottm Forest, Luton Town, Burnley, Sheffield United — Struggling on both ends. Relegation candidates and survival battles.

The data

Two Kaggle datasets. One merged picture.

The analysis draws on player-level and team-level data merged by team name — a step that required manual correction before any analysis could begin.

Primary dataset
premier-player-23-24
Individual player statistics for all 580 Premier League players in the 2023–24 season. Includes goals, assists, expected goals (xG), expected assists (xAG), progressive carries, minutes played, cards, and per-90 metrics.
580 rows 32 columns 20 teams
Secondary datasets
pl_table_2023_24 + supplementary
Final league table with goals scored/conceded per team, plus supplementary datasets on match results (381 matches), player tackles won, player red cards, and team foul/yellow card rates by match.
381 matches 303 tackle records 55 red card records
Data engineering challenge
Three team names didn't match between datasets: Brighton vs Brighton & Hove Albion, Bournemouth vs AFC Bournemouth, and Wolverhampton vs Wolverhampton Wanderers. Manually resolved before merge. The scoresStr column (e.g. "66-42") was split into separate Team_Goals_For and Team_Goals_Against numeric fields, and Player_Pos was derived by taking only the first-listed position from multi-position entries (e.g. "MF,FW" → "MF").

The AI layer

Stats in. Narrative out. Built for the people who need it.

Statistical outputs are accurate but not actionable for coaches, journalists, or scouts without interpretation. The Azure AI Prompt Flow takes Python-computed metrics and feeds them to a GPT-4o pipeline that generates structured SWOT reports — the kind of analysis a junior analyst would spend hours writing, delivered in seconds through Azure AI Studio's low-code interface.

🐍
Python analysis
Pandas computes G+A, xG vs Gls delta, positional breakdowns, and team quadrant classification
Azure AI Prompt Flow
Structured metrics passed through sequential GPT-4o LLM nodes — one per SWOT dimension — before final synthesis
📋
SWOT report
Coaches, analysts, and journalists receive a plain-language tactical brief with embedded statistics — no stats background required

Prompt Flow architecture

A cloud-native, agentic workflow built on Azure.

The system is built around a deterministic Python backbone paired with sequential LLM reasoning — ensuring accurate metric computation happens before any AI interpretation occurs. Player and team statistics are stored in Azure Blob Storage as .csv files, retrieved securely at runtime via a Python node that loads the data into pandas for preprocessing.

Azure AI Prompt Flow graph — inputs branch into Strength, Weakness, Opportunity and Threat LLM nodes, converge into Summary, and output the final report
Azure AI Studio · Prompt Flow Graph inputs → S · W · O · T nodes → Summary → outputs
☁️
Azure Blob Storage
Player and team-level statistics stored as .csv files in Azure Blob Storage. The Python node retrieves the dataset using secure account credentials at runtime — keeping sensitive credentials out of the codebase.
🐍
Python node — deterministic backbone
Filters data for the target team, computes KPIs — goals, assists, xG, xA, defensive indicators — and produces structured summaries passed downstream. No LLM reasoning touches the data until this step is complete.
💬
Two primary inputs
The flow accepts the target Premier League team name and optional chat history — preserving analytical context across multi-turn interactions so follow-up questions build on prior analysis naturally.
🤖
GPT-4o via Azure OpenAI
Each SWOT dimension is powered by a dedicated GPT-4o LLM node via Azure OpenAI. The final summary node synthesises a narrative-style tactical brief that reads like a professional sports journalist's analysis — grounded in deterministic data.

LLM node design

Four lenses. One tactical picture.

Each GPT-4o node interprets the Python-computed team metrics through a specific tactical lens. The nodes run in parallel from the same inputs, then converge into a final Summary node that synthesises a cohesive narrative with embedded statistics.

✓ Strength
Highlights dominant tactics and key contributors — which players are overperforming xG, which positions drive the most G+A, and what tactical patterns define the team's best performances.
− Weakness
Surfaces vulnerabilities and inefficiencies — teams underconverting chances, defensive frailties revealed by goals conceded vs xGA, and positional gaps in contribution metrics.
→ Opportunity
Identifies growth potential and exploitable matchups — where a team's xG profile suggests upside, or where upcoming fixture lists create favourable conditions for improvement.
⚠ Threat
Assesses external risks — key player injury exposure, schedule congestion, tactical overreliance on one contributor, or opponent-specific weaknesses that could be exploited against this team.
Key design feature

The system accepts contextual natural-language inputs alongside the team name — instructing the LLM nodes to provide explanatory analysis of specific questions posed by the user, with supporting graphs and visualisations generated from the dataset embedded directly in the output. This closes the loop between raw statistical analysis and the kind of tailored insight a coach or journalist would actually ask for.


Mathematical Framework

Four questions. Real statistical tests where it counts.

Research Question 1
Which teams had the highest average expected goals (xG), and how does that compare with their actual goals scored?
Grouped by team, computed average xG and actual goals per player. Reveals which teams are converting chances above expectation (overperformers) and which are leaving goals on the pitch.
Man City — avg xG / Gls3.28 xG → 3.76 Gls ↑
Arsenal — avg xG / Gls3.12 xG → 3.44 Gls ↑
Liverpool — avg xG / Gls3.01 xG → 2.67 Gls ↓
Everton — avg xG / Gls2.16 xG → 1.54 Gls ↓
Phil Foden — xG overperformer10.3 xG → 19 Gls (+8.7)

Finding: High xG generally correlates with more goals, but conversion varies significantly. Everton left the most goals on the pitch. Foden was the season's biggest xG overperformer at +8.7 goals above expectation.
Research Question 2
How does player contribution (goals + assists) differ across positions — DF, MF, FW?
Grouped by primary position, computed mean G+A. Ran a one-tailed t-test to confirm whether midfielders contribute significantly more than defenders — a distinction that informs how clubs value positional roles.
FW — mean G+A6.70
MF — mean G+A3.47
DF — mean G+A1.81
GK — mean G+A0.05

H₀: MF = DF mean G+ANull hypothesis
t-statistic4.29
p-value0.0000132

Null hypothesis rejected. Midfielders contribute significantly more G+A than defenders (p << 0.05). The difference is not noise — it's structural.
Research Question 3
Who are the top 10 players with the highest combined goals and assists?
Sorted all 580 players by total G+A. The resulting leaderboard confirms which forwards and attacking midfielders carried the most attacking weight for their teams across the full season.
#1 Cole Palmer (Chelsea)22G + 11A = 33
#2 Erling Haaland (Man City)27G + 5A = 32
#3 Ollie Watkins (Aston Villa)19G + 13A = 32
#4 Mohamed Salah (Liverpool)18G + 10A = 28
#5 Phil Foden (Man City)19G + 8A = 27

Finding: Cole Palmer led the league in G+A in his debut season at Chelsea. Haaland led outright in goals (27). Watkins was the most balanced contributor: second in assists despite being a striker.
Research Question 4
How can we categorize teams based on total goals scored and conceded?
Compared each team's goals for/against to league averages, classifying all 20 teams into four quadrants. A second t-test confirmed that teams with above-average xG score significantly more goals in practice.
H₀: high xG = low xG teams in actual GlsNull hypothesis
t-statistic3.22
p-value0.0036

Null hypothesis rejected. Teams with higher xG score significantly more goals (p = 0.0036). xG is a meaningful predictor of offensive output at team level.

Personal reflection
Behind the build

The most interesting tension in this project was the gap between what the statistics say and what a coach or journalist actually needs to hear. A Cpk of 0.396 means nothing without context. Neither does a p-value of 0.0000132 — unless someone translates it into a decision. The Prompt Flow SWOT layer was built to close that gap: taking outputs that are statistically rigorous and making them accessible to people who don't read pandas DataFrames for fun.

The xG analysis surfaced something genuinely surprising. Liverpool, despite being one of the top xG generators in the league, underperformed their expectation significantly — while Manchester City and Arsenal converted above it. That kind of efficiency differential doesn't show up in the final league table in an obvious way, but it explains a lot about how close the title race actually was on a per-chance basis.

The broader lesson: combining deterministic statistical analysis with LLM narrative synthesis isn't just a technical pattern — it's a communication strategy. The numbers build credibility. The narrative builds understanding. Neither works as well without the other.

Interested in working together?

Get in touch