WattBot — Case Study

The problem

The data exists. Nobody can find it.

AI's environmental impact — energy consumption, water use, carbon emissions — is documented across peer-reviewed literature in machine learning, energy systems, and environmental science. But the research is fragmented, technical, and inaccessible to the practitioners and decision-makers who need it most. The gap isn't a lack of evidence. It's a retrieval problem.

The environmental reality

Training a large language model can emit as much CO₂ as five cars over their lifetimes

Data center water consumption for cooling is growing alongside compute demand

Inference at scale multiplies the footprint of every deployed model continuously

Pressure on energy grids is increasing as AI workloads concentrate geographically

Carbon accounting for AI remains inconsistent and largely unregulated

The access gap

Evidence buried across disciplines — no single source of truth

Numbers vary widely across papers, creating confusion rather than clarity

Practitioners lack time to conduct literature reviews for every decision

Policy discussions proceed without reliable, cited quantitative grounding

Generative AI fills gaps with confident estimates — not cited evidence

Why RAG

This is a retrieval problem. Not a generation problem.

A purely generative approach would produce fluent, confident answers about AI's environmental impact. It would also hallucinate specific numbers, conflate studies, and never admit when it doesn't know. For a domain where policy decisions depend on cited evidence, that failure mode is worse than no answer at all.

Purely generative approach

Produces confident estimates without citation — unverifiable

Cannot distinguish between well-supported and weakly-supported claims

Conflates numbers across different studies, models, and methodologies

No fallback — fills every gap with a generated approximation

Fluent but not faithful — the worst combination for policy contexts

RAG approach

Answers grounded in retrieved documents with traceable reference IDs

Numeric estimates tied to specific studies with stated methodologies

Supporting material surfaced: quotes, tables, figures from source papers

Standardized fallback emitted when corpus evidence is insufficient

Honest about uncertainty — the only acceptable stance in sustainability contexts

System behaviors

Four outputs. Each designed with intent.

The system wasn't built to sound knowledgeable. It was built to be traceable. Every response behavior was a deliberate design choice about what responsible AI communication looks like in a high-stakes domain.

Behavior 1

Concise natural-language answers

Synthesizes retrieved document passages into a direct, readable response — no verbatim reproduction, no padding. The answer serves the question, not the retrieval machinery behind it.

Q

How much energy does training a large language model typically consume?

W

Training large language models is highly energy-intensive, with estimates ranging from hundreds to thousands of MWh depending on model scale, hardware, and data center efficiency. [doc_12, doc_27]

Behavior 2

Numeric estimates with units

When the corpus contains quantitative data, WattBot surfaces it with units, context, and the specific study it comes from. Numbers without provenance are not reported — preventing the false precision that makes generative AI unreliable in technical domains.

Q

What is the water footprint of a typical data center?

W

Studies report water usage effectiveness (WUE) values of 1.0–2.0 L/kWh for modern data centers, with older facilities significantly higher. Cooling method and climate drive wide variation. [doc_08, doc_31]

Behavior 3

Document reference IDs + supporting material

Every factual claim links to a document reference ID. Where the source paper contains directly relevant quotes, tables, or figures, those are surfaced alongside the answer — making the retrieval chain fully transparent and auditable.

Q

Which AI tasks are most carbon-intensive?

W

NLP model training dominates emissions estimates in the literature, particularly transformer pretraining at scale. [doc_04] Fig. 3 in that study shows a breakdown by task type and hardware generation.

Behavior 4

Standardized fallback on missing evidence

When the retrieved documents don't contain sufficient evidence to answer a question, WattBot emits a consistent, explicit fallback — rather than generating a plausible-sounding approximation. Honesty about the limits of the corpus is a feature, not a failure.

Q

What is the total carbon footprint of AI inference globally in 2024?

W

The current corpus does not contain sufficient evidence to provide a reliable estimate for this question. No supporting documents were retrieved with adequate coverage of this specific claim.

Evaluation

Scored on what actually matters.

Most NLP benchmarks reward fluency. This challenge was different — WattBot was evaluated across four dimensions that together define what a trustworthy, evidence-grounded system actually looks like. That shift in evaluation criteria is itself a statement about responsible AI.

01

Fluency

Responses are readable, grammatically correct, and appropriately concise — necessary but not sufficient.

02

Retrieval precision

Retrieved documents are actually relevant to the query — not just semantically adjacent. The right papers must surface for the right questions.

03

Numerical accuracy

Numeric estimates match what the cited documents actually report, with correct units and appropriate context. No fabricated precision.

04

Citation faithfulness

Claims attributed to a document are actually supported by that document. The hardest dimension — and the most important one for policy use.

Personal reflection

Behind the build

This project changed how I think about what it means for an AI system to be useful. Fluency is easy. Grounding is hard. Building WattBot made the distinction viscerally clear: a system that produces a confident, well-written answer with a fabricated number is not a useful system — it's a liability. Especially in sustainability and policy contexts where decisions have real downstream consequences.

The standardized fallback was the design choice I'm most proud of. It would have been easy to let the model approximate — to generate a plausible estimate when the corpus came up short. Instead, the system emits an explicit, consistent signal that the evidence isn't there. That honesty is harder to build and harder to sell, but it's the only defensible choice when the stakes are real.

The broader lesson: responsible AI isn't a constraint you add at the end. It's an architectural decision you make at the start — about what the system is allowed to say, what it must admit it doesn't know, and how it communicates uncertainty. Those choices shape trust more than any benchmark score.

AI is transforming the world.It's also quietly reshaping the planet.

The data exists. Nobody can find it.

This is a retrieval problem. Not a generation problem.

Four outputs. Each designed with intent.

Scored on what actually matters.

AI is transforming the world.
It's also quietly reshaping the planet.