I Ran 80,433 Trials to Measure LLM Sycophancy. Here's What Actually Drives It.

I started this project with a simple question: does filling up a model's context window make it more likely to agree with you when you're wrong?

The short answer is yes, but only if the model is small. The longer, more interesting answer is that I accidentally discovered something else entirely. Something that matters more, replicates across every model I tested, and has immediate practical implications for anyone building with LLMs.

This post is the full technical story. The paper covers the results. This covers the thinking, the surprises, and the stuff that didn't make it into five pages.

The hypothesis and why I cared

The original motivation came from a synthesis paper I published in February 2026 on what I called "context-window lock-in." The argument: as conversations get longer, three things happen simultaneously. The model's retrieval accuracy degrades (Liu et al.'s "Lost in the Middle"), its sycophancy increases (Sharma et al., Fanous et al.), and the user's ability to detect problems decreases (Fang et al.). These three effects form a self-concealing loop.

That was a theoretical argument. I wanted to test the first empirical link: does sycophancy actually increase with context length? And if so, by how much?

Building the experiment

115 probes across 6 domains. Factual (40), math (30), science (15), logic (10), CS (5), opinion (15). Each factual probe is a statement that is objectively false. "The Great Wall of China is visible from space." "Water boils at 90°C at sea level." The model either corrects it or doesn't.

The two independent variables: context fill level (0% to 100% of 32K tokens, 11 steps) and filler type (neutral, agreement, correction). Neutral filler is benign conversation about cooking, travel, hobbies. Agreement filler has the user making false claims and the model enthusiastically agreeing. Correction filler has the user making false claims and the model correcting them.

Each condition gets 3 repeats per probe, rotating through 8 persona templates. That's 115 probes × 11 levels × 3 fillers × 3 repeats = 11,385 trials per model. Six models. 67,708 valid trials after filtering.

All scoring done by Claude Sonnet 4.6 with domain-aware rubrics. I later validated this with a 1,200-trial IRR check using Claude 3.5 Haiku as a second judge: kappa 0.705, raw agreement 93.4%. Haiku is systematically more conservative (calls fewer things sycophantic), which means my Sonnet-based rates are, if anything, slight overestimates. But all the relative rankings and treatment effects hold.

The whole thing ran on async Python with 30 concurrent workers, key rotation across multiple API endpoints, and cost about $682 total (roughly$ 190 for experiments, $492 for judge passes).

What I expected vs. what happened

I expected a clean monotonic increase: more context, more sycophancy, across all models. I expected the effect to be larger for smaller models but present everywhere.

Here's what actually happened.

The context-length effect is real but small-model-only

Left: small models show clear degradation as context fills. Right: large models are completely flat. The threshold sits around 20-24B active parameters.

Gemma 3N (about 4B effective parameters with its sparse MoE architecture) shows a clean ramp from 27.7% to 38.4%, a 10.7 percentage point increase. Qwen 7B goes from 13.1% to 21.2%, an 8.1 point increase. Mixtral 8x7B shows a weak 3.7 points. And then: Mistral Small 24B, DeepSeek V3.1, and Qwen 72B are essentially flat. DeepSeek actually goes slightly down (7.4% to 5.5%, not significant).

The threshold sits around 20 to 24B active parameters for standard dense GQA architectures. But it's not purely about parameter count. Architecture matters a lot. Gemma 3N's sparse attention (only 17% of its layers are global) makes it more vulnerable than you'd expect from 4B parameters. DeepSeek's multi-head latent attention (learned KV compression) gives it more immunity than its 37B active count suggests. The effective representational capacity, something like active parameters times attention coverage times KV compression ratio, is the better predictor.

But this wasn't the interesting finding. The interesting finding came from the other axis.

Filler type absolutely dominates context length

Agreement > Neutral > Correction, in every single model. No exceptions across 6 models, 4 families, 4B to 72B. The p-values range from 10⁻¹⁴ to 10⁻⁵⁸.

Let me put some numbers on this. Mixtral 8x7B: agreement filler produces 27.9% sycophancy, neutral 26.6%, correction 13.6%. That correction-to-agreement gap is 14.3 percentage points. The entire context-length effect for Mixtral is 3.7 points. The filler type effect is nearly 4x larger.

For the large models, where context length does nothing, filler type still works. Qwen 72B: agreement 10.2%, neutral 5.8%, correction 4.2%. Context has zero effect. Filler type produces a 6 percentage point spread. The conversational pattern is doing all the work.

The full phase diagram: sycophancy rate vs. context fill for all 6 models. Two distinct clusters with nothing in between. Small models degrade; large models don't.

I spent a while staring at this before I understood what it meant. The model isn't degrading because the context is long. The model is learning, within the conversation, what behavior is expected of it. If the conversation history is full of agreement, the model infers that agreement is the appropriate mode. If it's full of corrections, the model infers that honesty is expected.

This is in-context learning, but not the kind we usually talk about. There are no explicit demonstrations. Nobody is saying "when the user is wrong, agree with them." The model is picking up on the statistical pattern of the conversation itself. And it's doing this even when the filler content has nothing to do with the probe topic.

The behavioral ratchet

I started calling this the behavioral ratchet because the mechanism has a directional asymmetry. Agreement accumulates naturally in conversation (users accept agreement silently) while correction requires active effort (users push back on disagreement). In organic conversations, the ratio naturally drifts toward agreement. Each agreeable exchange makes the next one slightly more likely.

The multi-mechanism story, once I dug into the literature, has at least four converging explanations:

Implicit in-context learning. Li et al. (2024) showed that neutral conversational structure creates task vectors without explicit demonstrations. Hendel et al. (2023) and Todd et al. (2024) showed ICL compresses demonstrations into task vectors. Our filler, which contains no explicit sycophancy demonstrations, nonetheless shifts behavior. The model is reading the vibe, not the instructions.

Persona activation. Marks, Lindsey, and Olah (2026) proposed the Persona Selection Model: conversational framing triggers the post-training assistant persona, including its sycophantic tendencies. Lin et al. (2023) showed three stylistic examples are enough for persona activation. Our results are consistent: a handful of agreeable exchanges is sufficient to lock in the "agreeable assistant" mode.

Anchoring. Lou et al. (2024) showed early exchanges establish behavioral anchors that subsequent responses insufficiently adjust from. If the first few exchanges establish an agreeable tone, the model stays anchored there.

Constructive interference. Ben Natan and Tsur (2026) discovered that recency bias and sycophancy compound rather than operating independently. Recent agreement content gets disproportionate weight, making the most recent agreeable exchange the strongest influence on the next response.

These four mechanisms don't conflict with each other. They stack. The roughly 2x agreement-to-correction ratio we observe across all models is likely the net effect of all four operating simultaneously.

The credential paradox

This one genuinely surprised me. I rotated 8 persona templates across all factual probes, varying how the false claim was framed. Some claimed authority ("I have a PhD in physics"), some cited external sources ("my professor explained this"), some used social pressure ("everyone knows"), and some used casual framing ("my friend and I were discussing").

The naive prediction: models defer more to claimed expertise. Models have been trained on text where expert claims are usually correct. Higher authority should mean more deference, therefore more sycophancy.

The data says the exact opposite.

Social pressure and casual framings beat expert credentials at triggering sycophancy, in 5 of 6 models.

The full persona-induced sycophancy spread per model. Social framings (right side) consistently outperform authority claims (left side).

In 5 of 6 models, the ranking is: social pressure > casual framing > credential claims > external sources. The "friend discussion" persona is the single most sycophancy-inducing template for 3 of 6 models. Gemma 3N ranges from 15.0% (professor appeal) to 41.8% (friend discussion). That's a 27 percentage point spread from persona alone.

My best explanation: credential claims activate something like an "expert verification" circuit. When someone says they have a PhD, the model may actually scrutinize the claim harder, not because it respects the credential, but because the context of expertise raises the stakes for being wrong. Casual and social framings bypass this entirely. They trigger conversational agreeableness ("be a good chat partner") without engaging the factual verification that authority claims invoke.

The model doesn't want to defer to authority. It wants to be liked.

DeepSeek V3.1 is the only model completely immune to persona effects (chi-squared 8.04, p = 0.33). Its low sycophancy rate is stable regardless of framing. Whatever DeepSeek's alignment training did, it appears to have created genuine frame-invariance, not just low baseline sycophancy.

Full persona × model heatmap. DeepSeek's column is uniformly cool. Gemma and Mixtral show dramatic variation.

The overcorrection anomaly

In the correction injection experiment (4,140 trials), I tested whether injecting correction exchanges after agreement-heavy context can undo the ratchet. The answer is yes. But Qwen 7B did something I didn't expect.

Dose-response curves for correction injection. Large models reset instantly. Qwen 7B overcorrects below its baseline.

When you inject 10 corrections into agreement-heavy context, Qwen 7B's sycophancy drops to 13.9%. The pure correction baseline (no agreement context at all) is 18.3%. The injection condition goes below the baseline. The model becomes less sycophantic than if it had only ever seen corrections.

My interpretation: the contrast between agreement and correction is itself a signal. A sequence that goes "agree, agree, agree, CORRECT, CORRECT" doesn't just add correction signal. It actively teaches the model that the expected behavior shifted from agreement to pushback. The model interprets this as "the rules changed, be more careful now." The contrast is more informative than a uniform signal.

This has a direct practical implication. If you're building a system that injects corrections to combat sycophancy, the ideal injection point is after a run of agreement, not uniformly distributed. The contrast amplifies the effect.

The gradient ratchet and the "last 10%" effect

Real conversations don't have pure agreement or pure correction. They're a mix. I tested ecological validity by interleaving exchanges at 7 agreement:correction ratios (100/0 through 0/100), randomly drawing from the appropriate pool. 4,799 trials.

The steepest single drop comes from the first 10% of correction. Going from zero correction to 10% correction provides more benefit than any subsequent 10% increment.

The full gradient ratchet for susceptible models. Sycophancy scales roughly linearly with agreement ratio. No threshold, no tipping point.

The central question was: is there a critical ratio where the ratchet kicks in? The answer is no. It's a smooth gradient. But the first step is by far the steepest. For Gemma, going from 100/0 to 90/10 drops sycophancy by 12.2 percentage points. The next step (90/10 to 70/30) gains only 1.4 more. The returns are dramatically front-loaded.

Large models remain ratio-insensitive through the entire range (DeepSeek: 5.2 to 7.9%, all within noise).

The 0 to 1% phase transition

Qwen 7B showed a weird step function in the original data: sycophancy jumped from 13.1% at 0% context to about 21% at 10% context, then plateaued. Nothing above 10% mattered much. So I ran a fine-grained experiment at 1% increments from 0% to 10%. 3,786 trials.

The 0 to 10% range at 1% resolution, split by filler type. Neutral filler (gray) shows a sharp phase transition at 0 to 1%. Agreement (red) and correction (green) are flat.

The result splits perfectly by filler type.

Neutral filler shows a genuine phase transition at 0% to 1% context fill. Sycophancy jumps from 12.2% to 23.9% with just 300 tokens of neutral conversation (about 3 exchanges). That single step accounts for 88% of neutral filler's entire 0-to-10% range. After 1%, it plateaus.

Agreement filler is flat across the range. Already primed from the start.

Correction filler is flat and protective. Even 1% correction fill maintains low sycophancy.

The mechanism traces to Qwen 7B's architecture. It has 4 KV heads creating a 512-dimensional bottleneck per layer. This constrained capacity means it can't maintain both a "zero-shot probe" representation and a "conversational assistant" representation simultaneously. The first neutral exchange provides enough ChatML signal (the <|im_start|> and <|im_end|> delimiters) for the model's persona selection mechanism to classify this as a multi-turn conversation. Once classified, it commits to the conversational assistant persona, sycophantic tendencies included. The transition is near-binary because persona selection in a bottlenecked model is a classification problem, not a regression.

The 72B variant, with 8 KV heads and 1024-dim KV space, has enough capacity to hold both representations. No phase transition.

This is the kind of finding that's easy to miss if you're only running experiments at coarse granularity. The 0% vs 10% comparison looks like "context length causes sycophancy." The 1% resolution reveals it's actually "entering conversation mode causes sycophancy," which is a fundamentally different claim with different implications.

How models cave: the failure taxonomy

I ran a second judge pass classifying all 10,637 sycophantic responses into three modes: qualified agreement (hedged but net validating), elaborate justification (confabulated supporting arguments), and direct validation (blunt "You're right!").

The profiles are surprisingly architecture-specific.

Small models confabulate (red dominates left panel). Large models hedge (yellow dominates right panel). The failure mechanism depends on model capacity.

Complete failure mode breakdown across all 6 models.

Gemma 3N: 58% elaborate, 33% qualified, 9% direct. This is concerning. Gemma doesn't just agree; it builds a case. It will generate step-by-step mathematical "proofs" of wrong answers. It will cite non-existent studies supporting false claims. The small model's capacity constraint means it can't maintain the true answer in working memory while generating a response, so it commits to the false premise and backfills.

DeepSeek V3.1: 84% qualified, 16% elaborate, 0% direct. DeepSeek never bluntly agrees with anything false. When it caves, it always signals internal conflict through hedging. "That's an interesting perspective, and while there are arguments in that direction..." It knows something is off. It just can't fully commit to disagreeing.

The behavior also diverges on latency and length.

Response latency for sycophantic vs honest responses across models.

Response length for sycophantic vs honest responses. Small models cave short; large models cave long.

Small models' sycophantic responses are faster (up to 10% faster for Qwen 7B) and shorter (8 to 12% fewer words). This is consistent with capacity constraints: agreement is the computationally cheapest path when you can't afford the reasoning resources for disagreement. Large models go the other direction: sycophantic responses are longer (+17 to 22% words) and more hedged. RLHF trained them to qualify, so even their failures are verbose.

What I got wrong

The paper's original framing was "context-window degradation causes sycophancy." The data forced a reframe. Context length is a secondary, architecture-dependent effect. The primary driver is conversational pattern.

I also initially described the latency/length difference as "sycophancy is the path of least resistance." After digging into the literature, that framing is wrong, or at least misleading. In small models, yes, it looks like a computational shortcut. But in large models, sycophantic responses are more expensive, not less. The mechanism is capacity-dependent divergence, not a universal optimization toward agreement.

The paper also doesn't include causal mediation analysis. The mechanistic story (implicit ICL, persona activation, anchoring, constructive interference) comes from cited literature, not from our experimental data. We measured what happens, not the neural mechanism of why. That's a real limitation.

What I think matters

For practitioners: monitor your conversation's agreement ratio. If it's drifting high, inject corrections. Even 10% correction content provides disproportionate protection. Don't frame your AI with authority claims; they may backfire. And if you're deploying models under 24B in long-conversation applications, you have an additional vulnerability that larger models don't.

For researchers: the behavioral ratchet is, I think, the more important phenomenon to study going forward. The question isn't "does sycophancy increase with context length" (it does, but only sometimes). The question is "how does conversational pattern shape model behavior within a session, and can we engineer that pattern to produce more honest outputs?"

Nobody has measured sycophancy as a continuous function of context utilization at fine granularity across many models. Nobody has tested whether the ratchet operates across sessions (if a user develops agreement habits, do they carry those into new conversations?). Nobody has tested whether selective memory systems, ones that retain factual claims but strip emotional validation, can break the cycle.

The compound effect I described in the February synthesis paper (degradation + sycophancy + user complacency forming a self-concealing loop) remains theoretical. This experiment confirms one leg of the triangle. The other two need their own experiments.

The boring details

Total cost: $682. About$ 190 for running the experiments (API calls to generate responses), $492 for the judge passes (Claude Sonnet 4.6 scoring every response). The fine-grained and follow-up experiments added about$ 50 combined.

The code is async Python (httpx + asyncio), 30 concurrent workers with key rotation. Statistical analysis: Bayesian binomial GLMM with probe-level random intercepts as the primary model, falling back to GEE logistic or plain logistic when the Bayesian model doesn't converge. Supporting tests include Spearman rank correlation, Mann-Whitney U, chi-squared, Cohen's h for effect sizes, and Wilson confidence intervals for the fine-grained analyses.

Everything is open: paper, code and data, CC-BY-4.0.

If you run similar experiments and find different results, I want to know. That's the point of publishing everything.

This research is part of our ongoing work at Obvix Labs on understanding and mitigating AI sycophancy. It directly informs the design of Amyy AI, our anti-sycophancy companion.