RAG · AI Sustainability · UW–Madison ML Marathon 2025
AI is transforming the world.
It's also quietly reshaping the planet.
The environmental cost of AI is real — but the evidence is buried across hundreds of peer-reviewed papers that most practitioners never read. WattBot is a RAG chatbot built to fix that: grounded, citation-backed answers on AI's energy and water footprint, with an honest fallback when the evidence simply isn't there.
Retrieval-Augmented Generation
50+ scholarly articles
Citation faithfulness
Numerical accuracy
Standardized fallback
AI sustainability
The problem
The data exists. Nobody can find it.
AI's environmental impact — energy consumption, water use, carbon emissions — is documented across peer-reviewed literature in machine learning, energy systems, and environmental science. But the research is fragmented, technical, and inaccessible to the practitioners and decision-makers who need it most. The gap isn't a lack of evidence. It's a retrieval problem.
The environmental reality
Training a large language model can emit as much CO₂ as five cars over their lifetimes
Data center water consumption for cooling is growing alongside compute demand
Inference at scale multiplies the footprint of every deployed model continuously
Pressure on energy grids is increasing as AI workloads concentrate geographically
Carbon accounting for AI remains inconsistent and largely unregulated
The access gap
Evidence buried across disciplines — no single source of truth
Numbers vary widely across papers, creating confusion rather than clarity
Practitioners lack time to conduct literature reviews for every decision
Policy discussions proceed without reliable, cited quantitative grounding
Generative AI fills gaps with confident estimates — not cited evidence
Why RAG
This is a retrieval problem. Not a generation problem.
A purely generative approach would produce fluent, confident answers about AI's environmental impact. It would also hallucinate specific numbers, conflate studies, and never admit when it doesn't know. For a domain where policy decisions depend on cited evidence, that failure mode is worse than no answer at all.
Purely generative approach
Produces confident estimates without citation — unverifiable
Cannot distinguish between well-supported and weakly-supported claims
Conflates numbers across different studies, models, and methodologies
No fallback — fills every gap with a generated approximation
Fluent but not faithful — the worst combination for policy contexts
RAG approach
Answers grounded in retrieved documents with traceable reference IDs
Numeric estimates tied to specific studies with stated methodologies
Supporting material surfaced: quotes, tables, figures from source papers
Standardized fallback emitted when corpus evidence is insufficient
Honest about uncertainty — the only acceptable stance in sustainability contexts
System behaviors
Four outputs. Each designed with intent.
The system wasn't built to sound knowledgeable. It was built to be traceable. Every response behavior was a deliberate design choice about what responsible AI communication looks like in a high-stakes domain.
Behavior 1
Concise natural-language answers
Synthesizes retrieved document passages into a direct, readable response — no verbatim reproduction, no padding. The answer serves the question, not the retrieval machinery behind it.
Q
How much energy does training a large language model typically consume?
W
Training large language models is highly energy-intensive, with estimates ranging from hundreds to thousands of MWh depending on model scale, hardware, and data center efficiency. [doc_12, doc_27]
Behavior 2
Numeric estimates with units
When the corpus contains quantitative data, WattBot surfaces it with units, context, and the specific study it comes from. Numbers without provenance are not reported — preventing the false precision that makes generative AI unreliable in technical domains.
Q
What is the water footprint of a typical data center?
W
Studies report water usage effectiveness (WUE) values of 1.0–2.0 L/kWh for modern data centers, with older facilities significantly higher. Cooling method and climate drive wide variation. [doc_08, doc_31]
Behavior 3
Document reference IDs + supporting material
Every factual claim links to a document reference ID. Where the source paper contains directly relevant quotes, tables, or figures, those are surfaced alongside the answer — making the retrieval chain fully transparent and auditable.
Q
Which AI tasks are most carbon-intensive?
W
NLP model training dominates emissions estimates in the literature, particularly transformer pretraining at scale. [doc_04] Fig. 3 in that study shows a breakdown by task type and hardware generation.
Behavior 4
Standardized fallback on missing evidence
When the retrieved documents don't contain sufficient evidence to answer a question, WattBot emits a consistent, explicit fallback — rather than generating a plausible-sounding approximation. Honesty about the limits of the corpus is a feature, not a failure.
Q
What is the total carbon footprint of AI inference globally in 2024?
W
The current corpus does not contain sufficient evidence to provide a reliable estimate for this question. No supporting documents were retrieved with adequate coverage of this specific claim.
Evaluation
Scored on what actually matters.
Most NLP benchmarks reward fluency. This challenge was different — WattBot was evaluated across four dimensions that together define what a trustworthy, evidence-grounded system actually looks like. That shift in evaluation criteria is itself a statement about responsible AI.
01
Fluency
Responses are readable, grammatically correct, and appropriately concise — necessary but not sufficient.
02
Retrieval precision
Retrieved documents are actually relevant to the query — not just semantically adjacent. The right papers must surface for the right questions.
03
Numerical accuracy
Numeric estimates match what the cited documents actually report, with correct units and appropriate context. No fabricated precision.
04
Citation faithfulness
Claims attributed to a document are actually supported by that document. The hardest dimension — and the most important one for policy use.
Personal reflection
Behind the build
This project changed how I think about what it means for an AI system to be useful. Fluency is easy. Grounding is hard. Building WattBot made the distinction viscerally clear: a system that produces a confident, well-written answer with a fabricated number is not a useful system — it's a liability. Especially in sustainability and policy contexts where decisions have real downstream consequences.
The standardized fallback was the design choice I'm most proud of. It would have been easy to let the model approximate — to generate a plausible estimate when the corpus came up short. Instead, the system emits an explicit, consistent signal that the evidence isn't there. That honesty is harder to build and harder to sell, but it's the only defensible choice when the stakes are real.
The broader lesson: responsible AI isn't a constraint you add at the end. It's an architectural decision you make at the start — about what the system is allowed to say, what it must admit it doesn't know, and how it communicates uncertainty. Those choices shape trust more than any benchmark score.