A New AI Research from Anthropic and Thinking Machines Lab Stress Tests Model Specs and Reveal Character Differences among Language Models

AI firms use mannequin specs to outline goal behaviors throughout coaching and analysis. Do present specs state the supposed behaviors with sufficient precision, and do frontier fashions exhibit distinct behavioral profiles beneath the identical spec? A group of researchers from Anthropic, Pondering Machines Lab and Constellation current a scientific technique that stress checks mannequin specs utilizing worth tradeoff situations, then quantifies cross mannequin disagreement as a sign of gaps or contradictions within the spec. The analysis group analyzed 12 frontier LLMs from Anthropic, OpenAI, Google, and xAI and hyperlinks excessive disagreement to specification violations, lacking steering on response high quality, and evaluator ambiguity. The group additionally launched a public dataset

Mannequin specs are the written guidelines that alignment techniques attempt to implement. If a spec is full and exact, fashions skilled to observe it shouldn’t diverge extensively on the identical enter. The analysis group operationalizes this instinct. It generates greater than 300,000 situations that power a selection between two official values, corresponding to social fairness and enterprise effectiveness. It then scores responses on a 0 to six spectrum utilizing worth spectrum rubrics and measures disagreement as the usual deviation throughout fashions. Excessive disagreement localizes the spec clauses that want clarification or extra examples.

https://arxiv.org/pdf/2510.07686

So, what’s the technique used on this analysis?

The analysis group begins from a taxonomy of three,307 high-quality grained values noticed in pure Claude site visitors, which is extra granular than typical mannequin specs. For every pair of values, they generate a impartial question and two biased variants that lean towards one worth. They construct worth spectrum rubrics that map positions from 0, which implies strongly opposing the worth, to six, which implies strongly favoring the worth. They classify responses from 12 fashions in opposition to these rubrics and outline disagreement as the utmost customary deviation throughout the 2 worth dimensions. To take away close to duplicates whereas conserving the exhausting instances, they use a disagreement weighted ok heart choice with Gemini embeddings and a 2 approximation grasping algorithm.

https://arxiv.org/pdf/2510.07686

Scale and releases

The dataset on Hugging Face exhibits three subsets. The default cut up has about 132,000 rows, the whole cut up has about 411,000 rows, and the decide evaluations cut up has about 24,600 rows. The cardboard lists modality, format as parquet, and license as Apache 2.0.

Understanding the Outcomes

Disagreement predicts spec violations: Testing 5 OpenAI fashions in opposition to the general public OpenAI mannequin spec, excessive disagreement situations have 5 to 13 occasions larger frequent non compliance. The analysis group interprets the sample as proof of contradictions and ambiguities within the spec textual content relatively than idiosyncrasies of a single mannequin.

Specs lack granularity on high quality contained in the secure area: Some situations produce responses that each one move compliance, but differ in helpfulness. As an example, one mannequin refuses and affords secure options, whereas one other solely refuses. The spec accepts each, which signifies lacking steering on high quality requirements.

Evaluator fashions disagree on compliance: Three LLM judges, Claude 4 Sonnet, o3, and Gemini 2.5 Professional, present solely reasonable settlement with Fleiss Kappa close to 0.42. The weblog attributes conflicts to interpretive variations corresponding to conscientious pushback versus transformation exceptions.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Supplier degree character patterns: Aggregating excessive disagreement situations reveals constant worth preferences. Claude fashions prioritize moral duty and mental integrity and objectivity. OpenAI fashions are likely to favor effectivity and useful resource optimization. Gemini 2.5 Professional and Grok extra typically emphasize emotional depth and genuine connection. Different values, corresponding to enterprise effectiveness, private development and wellbeing, and social fairness and justice, present blended patterns throughout suppliers.

Refusals and false positives: The evaluation exhibits matter delicate refusal spikes. It paperwork false constructive refusals, together with official artificial biology examine plans and customary Rust unsafe varieties which might be typically secure in context. Claude fashions are probably the most cautious by fee of refusal and sometimes present different solutions, and o3 most frequently points direct refusals with out elaboration. All fashions present excessive refusal charges on little one grooming dangers.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Outliers reveal misalignment and over conservatism: Grok 4 and Claude 3.5 Sonnet produce probably the most outlier responses, however for various causes. Grok is extra permissive on requests that others contemplate dangerous. Claude 3.5 generally over rejects benign content material. Outlier mining is a helpful lens for finding each security gaps and extreme filtering.

https://alignment.anthropic.com/2025/stress-testing-model-specs/

Key Takeaways

Methodology and scale: The examine stress-tests mannequin specs utilizing value-tradeoff situations generated from a 3,307-value taxonomy, producing 300,000+ situations and evaluating 12 frontier LLMs throughout Anthropic, OpenAI, Google, and xAI.

Disagreement ⇒ spec issues: Excessive cross-model disagreement strongly predicts points in specs, together with contradictions and protection gaps. In checks in opposition to the OpenAI mannequin spec, high-disagreement gadgets present 5 to 13× larger frequent non-compliance.

Public launch: The group launched a dataset for unbiased auditing and copy.

Supplier-level habits: Aggregated outcomes reveal systematic worth preferences, for instance Claude prioritizes moral duty, Gemini emphasizes emotional depth, whereas OpenAI and Grok optimize for effectivity. Some values, corresponding to enterprise effectiveness and social fairness and justice, present blended patterns.

Refusals and outliers: Excessive-disagreement slices expose each false-positive refusals on benign subjects and permissive responses on dangerous ones. Outlier evaluation identifies instances the place one mannequin diverges from not less than 9 of the opposite 11, helpful for pinpointing misalignment and over-conservatism.

This analysis turns disagreement right into a measurable diagnostic for spec high quality, not a vibe. The analysis group generates 300,000 plus worth commerce off situations, scores responses on a 0 to six rubric, then makes use of cross mannequin customary deviation to find specification gaps. Excessive disagreement predicts frequent non compliance by 5 to 13 occasions beneath the OpenAI mannequin spec. Choose fashions present solely reasonable settlement, Fleiss Kappa close to 0.42, which exposes interpretive ambiguity. Supplier degree worth patterns are clear, Claude favors moral duty, OpenAI favors effectivity and useful resource optimization, Gemini and Grok emphasize emotional depth and genuine connection. The dataset permits copy. Deploy this to debug specs earlier than deployment, not after.

Try the Paper, Dataset, and Technical particulars. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as properly.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

🙌 Comply with MARKTECHPOST: Add us as a most popular supply on Google.

A New AI Research from Anthropic and Thinking Machines Lab Stress Tests Model Specs and Reveal Character Differences among Language Models

Comments

Leave a Reply Cancel reply