Home

Published

-

Analytic Hierarchy Process (AHP): Structured alternative to AI decisions

img of Analytic Hierarchy Process (AHP): Structured alternative to AI decisions

The quiet problem: we are outsourcing our decisions to language models

Large language models are no longer just tools for drafting text. They have quietly become decision-making partners. Microsoft’s Work Trend Index reported that nearly half of Copilot usage involves decision-making or activities surrounding it — weighing options, summarising trade-offs, suggesting next steps. The model doesn’t just write the email; it picks which message to send.

This shift matters because of something psychologists call automation bias. LLMs speak with fluency and apparent confidence. They produce coherent, well-structured prose that feels like the output of a thoughtful expert. That fluency hijacks our skepticism. Formally, the human still decides. In practice, the decision often collapses into whatever the model said first.

The research is starting to catch up with the intuition. A 2025 study by Gerlich linked heavy reliance on AI assistants to cognitive offloading — measurably lower engagement of critical thinking, with users feeling overwhelmed and defaulting to the model’s framing. A 2025 paper in Nature found that delegating decisions to AI systems can shift moral norms: people behave differently, and judge actions differently, when an algorithm is in the loop. According to Reuters, an Ipsos BVA survey from 2026 found that nearly half of young Europeans aged 11–25 turn to chatbots with personal and emotional problems. We aren’t just offloading spreadsheet decisions. We’re offloading the human ones.

What makes this insidious is that LLMs are stepping into the parts of the decision process that precede the formal choice: selecting which information matters, interpreting it, assessing risk, generating options, and often recommending the final answer. By the time the human “decides,” the decision has effectively already been made. The outward shape of the process is intact — a person weighed things and chose — but the substance has migrated. Three patterns capture how this plays out in practice. Rubber-stamping is the most visible: the human formally approves a decision but doesn’t reconstruct the reasoning behind it; the approval becomes a ritual. Decision homogenisation is subtler — millions of people consulting similar models receive similar interpretive frames, and diversity of thought quietly flattens toward whatever the median training corpus suggested. The third, and probably the most dangerous, is the illusion of understanding: LLMs are exceptionally good at producing the feeling of having thought something through. A user can walk away convinced they reasoned carefully when in reality they imported a ready-made frame.

So the interesting question isn’t should we use LLMs for decisions? — that ship has sailed. The question is: how do we use them in a way that preserves human judgment instead of quietly replacing it?

What decision-making actually is?

Before talking about how to use AI well, it’s worth being precise about what we’re using it for. A decision is an act of choice. It draws on information from the past and present, but its consequences always live in the future. The decision-making process typically unfolds in four phases: identifying the problem, working out possible solutions, selecting one, and putting it into practice. A decision isn’t real until it’s enacted.

Real decisions are made under constraints. Herbert Simon’s foundational work made the point that decision-makers almost never have complete information about their alternatives, and even when they do, the human mind has limited capacity to analyse, understand, and remember it all. Simon called this bounded rationality. We don’t optimise — we satisfice. We search until we find an option that’s good enough, then stop. There’s no guarantee the optimal choice was even in the set we considered.

This is where decision theory enters. The discipline is built on two ideas: preferences and prospects (or options). When we say an agent prefers option A to option B, we mean they judge A to be more desirable or choice-worthy.

Preference is inherently comparative. It’s a relation between options, not a property of one option in isolation.

This is easy to skim past, but it’s the foundation of everything that follows. You don’t “have a preference for coffee” in the strict sense; you have a preference for coffee over tea, or over nothing. Every preference is a comparison, even when one side of the comparison is implicit. This is why decision theory builds everything from pairs of options rather than absolute scores. It’s also why, later on, pairwise comparison turns out to be such a natural building block.

A rational preference ordering is typically required to satisfy two axioms:

  • Completeness - for any two options, the agent can say which is at least as good, or that they’re equally good.
  • Transitivity - if B is at least as good as A, and C is at least as good as B, then C is at least as good as A.

Transitivity is the more interesting of the two, and it’s worth seeing why. Imagine you’re picking a phone. You prefer Phone B to Phone A because B has a better camera. You prefer Phone C to Phone B because C has longer battery life. Transitivity says: you should therefore prefer C to A. If instead you find yourself preferring A to C — maybe because, comparing them directly, A’s screen looks nicer — your preferences cycle: A < B < C < A. Each individual comparison feels reasonable, but taken together they don’t form a coherent ranking.

Why is that a problem? The classical answer is the money pump argument. Suppose you own A. I offer to swap A for C plus a small fee, since you prefer C. You agree. Then I offer to swap C for B plus a small fee, since you prefer B. You agree. Then I offer to swap B for A plus a small fee, since you prefer A. You agree — and you’re back where you started, three fees poorer. I can repeat this indefinitely. Intransitive preferences are exploitable preferences. A coherent ranking, however small the differences, protects you from this.

To work with preferences mathematically, we need a way to turn this kind of ranking into numbers. That’s what a utility function is: an assignment of numbers to options such that more preferred options get higher numbers. The function doesn’t tell you what to want — it’s a mathematical record of what you’ve already said you want.

There are two flavours of utility function, and the difference matters.

An ordinal utility function records only the order. If you prefer Paris to Berlin to Rome, an ordinal utility might assign Paris = 3, Berlin = 2, Rome = 1. But it could equally assign Paris = 100, Berlin = 99, Rome = 1 — the ordering is preserved, so it’s the same ordinal utility for these purposes. The numbers carry no information about how much more you prefer one option to another. This is fine for picking the top of a list, but useless if you need to reason about trade-offs, risk, or expected value. You can’t meaningfully average ordinal utilities.

A cardinal utility function records the distances too. The gaps between numbers are meaningful: if Paris = 10, Berlin = 8, Rome = 2, that tells us your jump from Rome to Berlin is much larger than your jump from Berlin to Paris. The classical way to build cardinal utilities is through indifference points under uncertainty. Imagine you’re offered a coin flip: heads you get Paris, tails you get Rome. Would you take this gamble, or would you accept Berlin for sure? If you’re roughly indifferent at a coin flip, Berlin sits halfway between Rome and Paris on your scale. If you’d only take the gamble at 75% Paris / 25% Rome, Berlin sits three-quarters of the way up. By varying the probabilities until you reach indifference, we pin down exactly where Berlin lives between Rome and Paris in terms of how much you value it. Do this for every option, and you have a cardinal utility function.

The point of this detour is simple: humans have spent decades building rigorous, mathematically grounded tools for making good decisions. These tools have known properties, known limitations, and known failure modes. LLMs do not. When we let a language model frame our choices, we trade a method whose biases we understand for one whose biases we don’t.

Using LLMs well: appropriate reliance, not overreliance

The literature on AI decision support draws a sharp line between overreliance and appropriate reliance. Overreliance is the rubber-stamp pattern: the model decides, the human signs. Appropriate reliance is something else entirely — the model helps the human see options, surface assumptions, and stress-test reasoning, but the human stays in the driver’s seat.

Reviews of AI decision-support research converge on a few design principles for getting this right:

  • The system should reinforce user autonomy, not erode it.
  • It should calibrate trust — making the user appropriately skeptical of weak outputs and appropriately confident in strong ones.
  • It should keep the user in the role of active decision-maker, not passive approver.

The natural conclusion: where possible, use mathematically validated methods to structure the decision itself, and use the LLM in a supporting role. The math is deterministic and inspectable. The LLM is fluent and useful, but its biases are unknown and its confidence is unwarranted. Combining them well means letting each do what it’s actually good at.

AHP: a structured method for deciding in line with your own preferences

The Analytic Hierarchy Process (AHP), developed by Thomas Saaty in the 1970s, is one of the cleanest examples of this kind of structured method. Its core mechanic is pairwise comparison: instead of trying to score every option directly on every criterion, you compare two things at a time and say which you prefer, and by how much.

Pairwise comparison is, at its heart, the construction of a utility function.

That’s worth pausing on. Everything we said earlier about preferences being comparative, and about utility functions being numerical records of preference, comes back here. AHP doesn’t ask you to invent scores out of thin air — it asks you to do the one thing decision theory says preference actually is: compare two options at a time. The numerical utility comes out of those comparisons, not before them.

Concretely, AHP works in four steps, and it’s worth walking through them because the structure is doing real work.

1. Decompose the problem into a hierarchy. At the top is your goal (say: “choose a laptop”). Below it sit the criteria you care about (price, performance, battery, weight). At the bottom sit the alternatives (Laptop X, Y, Z). This step alone forces a clarity most decisions never get.

2. Pairwise-compare the criteria. For each pair, you say how much more important one is than the other, on Saaty’s 1–9 scale (1 = equally important, 3 = moderately more important, 5 = strongly more important, up to 9 = extremely more important). Is price more important than battery? Three times more? Five times? You answer this for every pair, filling in a matrix.

3. Pairwise-compare the alternatives under each criterion. Now, for each criterion separately, you compare the alternatives the same way. Under “battery,” how much better is Laptop X than Laptop Y? Under “price,” how much better is Y than Z? Each criterion gets its own comparison matrix.

4. Aggregate. AHP applies an eigenvector calculation to each matrix to extract a weight vector — essentially, the relative importance scores implied by your pairwise judgments. Multiplying the criterion weights by the alternative weights under each criterion gives you a final score for each alternative. The highest score is the AHP-recommended choice.

The step that does the most underrated work is the consistency check, which runs alongside step 2 and step 3. Here’s why it matters. If your pairwise judgments are perfectly transitive — say you said price is 2× battery, battery is 3× weight, so price should be 6× weight — your matrix will be mathematically consistent. But humans don’t reason that cleanly. You might also have said price is only 4× weight, contradicting the 6× implied by the chain. AHP computes a consistency ratio that quantifies how much your judgments deviate from perfect transitivity. Saaty’s rule of thumb is that a ratio below 0.1 is acceptable; above that, you should revisit your comparisons. The check doesn’t tell you which judgment is wrong — that’s your call — but it tells you that something in your stated preferences doesn’t cohere, and gives you a chance to fix it before the final ranking comes out. This is the money pump check from earlier, made operational.

The crucial property is this: AHP forces the user to articulate their own preferences, criterion by criterion, comparison by comparison. The output reflects your values, weighted by your judgments. The method doesn’t tell you what to want — it tells you what choice your wants imply, and flags it when your wants don’t quite add up.

Compare that to asking an LLM “which option should I choose?” The LLM will produce a confident answer drawn from a frame you didn’t construct, weighted by criteria you didn’t specify, reflecting biases you can’t audit. AHP makes the same problem solvable in a way you can actually verify.

It’s also worth distinguishing AHP from heuristic approaches like brainstorming, lateral thinking, or the Delphi method. Heuristics are useful — sometimes essential — but they’re explicitly approximate. An algorithm, by contrast, is a precise recipe. AHP is an algorithm. It will give you the same answer for the same inputs, and you can inspect every step.

A Claude Skill for AHP

The natural next step is operational: package AHP as a tool the LLM can invoke, rather than something the LLM tries to do in its head. A Skill (in the Claude sense — a folder of instructions and scripts the model loads when relevant) is a clean vehicle for this. The structure is small but deliberate:

   ahp-decision/
├── SKILL.md                          ← main workflow (7 steps)
├── scripts/
│   └── ahp_solver.py                 ← eigenvector, CR, aggregation, sensitivity
└── references/
    ├── saaty_scale.md                ← mapping natural language → 1–9 scale
    ├── converting_data.md            ← handling hard data (price, battery)
    └── example_walkthrough.md        ← full conversation example (laptop choice)

Each piece has a specific job, and the separation matters more than it might look at first.

SKILL.md is the orchestration layer — the document Claude reads when the Skill activates. It defines the seven-step workflow the model walks the user through:

  1. Frame the decision. Confirm what the user is actually choosing between, and why. Push back on vague framings (“I want a better job” → “compared to what specifically?”).
  2. Identify alternatives. Get a concrete, finite set of options on the table. Help the user surface candidates they may have dismissed too early, but never invent alternatives for them.
  3. Identify criteria. Help the user articulate what they care about. This is where the LLM’s breadth genuinely helps — surfacing dimensions the user may have forgotten — but the final list is the user’s call.
  4. Pairwise-compare the criteria. Walk through every pair, in natural language, mapping the answers onto Saaty’s scale via saaty_scale.md.
  5. Pairwise-compare the alternatives under each criterion. Same mechanic, one criterion at a time. Where hard data exists (price in złoty, battery in hours), defer to converting_data.md rather than asking for subjective comparisons.
  6. Compute and report. Hand the matrices to ahp_solver.py. Surface the ranking, the criterion weights, and the consistency ratio. If CR > 0.1, flag the most inconsistent judgments and offer to revisit them.
  7. Sensitivity check. Show how robust the ranking is — what happens if a criterion weight shifts by 10%? If the top choice flips easily, the decision is less settled than the number suggests.

The seven steps are written as instructions to Claude, not as a script the user sees. The user just talks; the structure happens around them.

scripts/ahp_solver.py is the deterministic core. It does the math the LLM should never do in its head: building the comparison matrices, computing the principal eigenvector for each (which gives the weight vector), calculating the consistency ratio against Saaty’s random index values, aggregating across criteria, and running sensitivity analysis. This is the part the LLM runs but does not reason about. The point of pulling it into a script isn’t speed — it’s auditability. A reviewer can read 80 lines of Python and verify exactly what produced the recommendation. They cannot do the same with an LLM’s internal computation.

references/saaty_scale.md addresses a real translation problem. Users don’t think in Saaty numbers. They say things like “battery matters a lot more than weight, but not insanely more.” This file maps natural-language intensities onto the 1–9 scale consistently: equally important → 1, slightly → 2, moderately → 3, moderate to strong → 4, strongly → 5, and so on. Without this, the LLM would silently improvise the mapping and the same user phrasing could end up as different numbers in different sessions. With it, the translation is stable and inspectable.

references/converting_data.md handles the cases where a criterion has objective measurements. If you’re comparing laptops on price, you have actual prices — there’s no reason to ask “how much cheaper does X feel than Y?” This file specifies how to convert hard data into pairwise ratios directly (typically by normalising the values and forming ratios), bypassing subjective judgment for the dimensions where it would just add noise. The Skill uses this whenever the user supplies real numbers, falling back to elicited comparisons only for genuinely subjective criteria.

references/example_walkthrough.md is a full worked conversation — a user choosing between three laptops, from initial framing through final ranking. It’s there because abstract instructions don’t always tell Claude how the conversation should feel: where to slow down, where to push back, how to phrase the comparison questions so they don’t sound like an interrogation. The walkthrough is the tonal anchor.

The division of labour across these files mirrors the division of labour across the whole approach: the LLM handles language, exploration, and clarification; the math handles the decision; the references handle translation between the two; the user supplies the preferences. No part of the actual choice gets quietly absorbed into the model’s latent priors.

A web app with a real UI

A chat interface is fine for exploring the idea, but pairwise comparison is fundamentally a UI problem. Asking “how much do you prefer A to B?” twenty times in a conversation is tedious. Doing it with sliders, a visible matrix, and a live-updating consistency indicator is much better.

A companion web app would offer:

  • Visual matrix entry — sliders or 1–9 scale buttons for each pairwise comparison, with the inverse populated automatically.
  • Live consistency feedback — a visible indicator that updates as the user enters comparisons, highlighting which judgments are creating the most inconsistency.
  • Sensitivity analysis — show how the final ranking shifts if a single criterion’s weight changes. This is where users often discover that their decision is more robust (or more fragile) than they thought.
  • An optional AI sidebar — for suggesting criteria, explaining the math, or helping articulate trade-offs, but visually and functionally separate from the comparison process itself.

The design principle throughout: keep the human’s preferences central and visible, and keep the AI’s role bounded and explicit.

Where this approach falls short

AHP is not a silver bullet, and pretending otherwise would repeat the exact error this article is arguing against.

  • Criterion selection is still subjective. AHP weights the criteria you give it. If you forget a relevant dimension, no amount of mathematical rigour will surface it. This is precisely where an LLM’s breadth can help — but also precisely where its blind spots can hide.
  • The 1–9 scale is a modelling assumption. Saaty’s scale is plausible but not the only choice, and small changes in the scale can shift outcomes.
  • Rank reversal. Adding or removing an alternative can sometimes change the ranking of the remaining options. There are AHP variants that address this, but it remains a real critique.
  • Many decisions aren’t decomposable. Emotional, creative, or deeply contextual choices may resist the kind of clean hierarchical structuring AHP requires. Forcing them into the framework can give a false sense of rigour.
  • Garbage in, garbage out. If the user’s pairwise judgments are themselves shaped by an LLM’s framing, AHP just launders that influence through a mathematical filter.

Alternatives worth knowing about include MAUT (multi-attribute utility theory), TOPSIS (ranking by distance from an ideal solution), ELECTRE and PROMETHEE (outranking methods), and Bayesian decision analysis for problems where uncertainty dominates. Each has its niche. AHP’s main appeal is that it is simple enough to actually use and rigorous enough to actually trust.

Where this could go

A few directions feel genuinely promising:

  • Group AHP. Aggregating preferences across multiple stakeholders, with the method exposing where disagreement is concentrated rather than hiding it under an average.
  • Hybrid LLM-AHP workflows. Using the model not just to suggest criteria but to challenge them — actively generating counter-arguments, devil’s-advocate alternatives, and overlooked considerations, while the AHP structure keeps the user’s preferences intact.
  • Decision archives. Storing past AHP analyses so users can see how their preferences and frames have shifted over time. This is the kind of metacognition LLMs alone don’t encourage.
  • Calibration tooling. Comparing predicted satisfaction from AHP outputs against post-decision outcomes, helping users learn where their stated preferences and their actual preferences diverge.

Closing thoughts

The genuine risk of LLMs in decision-making isn’t that they’ll give us bad answers. It’s that they’ll give us plausible answers, fluently, and we’ll forget that an answer plausibly framed is not the same as a decision properly made.

The way out isn’t to refuse the tool. It’s to use it in the place it actually belongs — as a fluent collaborator in framing, exploring, and stress-testing — while keeping the act of choosing inside a structure we can inspect. AHP isn’t the only such structure, but it’s a good one: simple enough to use, rigorous enough to trust, and transparent enough that the decision still belongs to the person making it.

A good decision-support system should leave you understanding your own preferences better than you did before you started. If you walk away from a tool feeling that it decided, the tool failed — no matter how good the recommendation was.

Related Posts

There are no related posts yet. 😢