Anthropic and Thinking Machines Lab Analyze Language Model Differences
Researchers from Anthropic and Thinking Machines Lab have introduced a systematic method to stress test language model specifications. The study uses value tradeoff scenarios to quantify cross-model disagreement as an indicator of specification gaps. The team analyzed 12 frontier LLMs from Anthropic, OpenAI, Google, and xAI, linking high disagreement to specification violations, lack of guidance on response quality, and evaluator ambiguity.
Model specifications are rules that alignment systems aim to enforce. The research team generated over 300,000 scenarios forcing choices between two values and scored responses on a 0 to 6 spectrum. High disagreement localizes specification clauses needing clarification or additional examples.
The team began with a taxonomy of 3,307 fine-grained values observed in natural Claude traffic. For each value pair, they created a neutral query and two biased variants, building value spectrum rubrics to classify model responses. Disagreement is defined as the maximum standard deviation across two value dimensions.
The dataset on Hugging Face shows three subsets: the default split with about 132,000 rows, the complete split with about 411,000 rows, and the judge evaluations split with about 24,600 rows. High disagreement scenarios predict 5 to 13 times higher non-compliance frequency. The team interprets this as evidence of contradictions and ambiguities in the specification text.
This research turns disagreement into a measurable diagnostic for specification quality. The team generates over 300,000 value tradeoff scenarios, scores responses on a 0 to 6 rubric, and uses cross-model standard deviation to locate specification gaps. High disagreement predicts frequent non-compliance by 5 to 13 times under the OpenAI model spec. Judge models show only moderate agreement, with Fleiss Kappa near 0.42.