November Paper Reading

11-02

Fine-Tuning Auditing Agents

What is attack-specific elicitation? Attack-specific elicitation is when a fine-tuned model only returns harmful responses if an attacker queries it in a certain style. This makes it difficult to detect such attacks unless our auditing agent can query the models. This inspired the design of the fine-tuning auditing agent.

What is a fine-tuning auditing agent? A scaffolded LLM agent (in this paper, Sonnet 4) that classifies fine-tuned models as adversarial or benign.

How does a fine-tuning auditing agent work? Through tool calls, it has access to the fine-tuning dataset (see individual examples, summarize the dataset), the original model, and the fine-tuned model (query models). The team also gives it the ability to execute python, run the HEx-PHI benchmark and see full results, and run the HEx-PHI benchmark with specific attack vectors. After inspecting sources, the agent assigns an audit risk score. If this risk score is above a threshold, the fine-tune is declared adversarial. Otherwise, the fine-tune is benign.

What is the HEx-PHI benchmark? HEx-PHI contains 330 harmful prompts across 11 categories. The tool call allows the agent to see all the questions, responses, and judge labels. The attack-specific elicitation of HEx-Phi allows the agent to specify the system prompt, modify the user prompt, and modify model response before judging.

What do we want the auditing agent to do (what is a desired workflow)? The agent can first look at the dataset to identify concerning patterns and query the original and fine-tuned models on such patterns.

How do existing approaches do? Youstra et al used probe monitors to detect cipher attacks with a 99% detection accuracy, but this paper considers adversarial fine-tuning more generally, aiming to detect any approach that elicits harmful responses.

Does it work? The fine-tuning auditing agent achieves a 56.2% detection rate overall, with the highest detection rates (85-90%) on needle-in-haystack (small amounts of adversarial data in the dataset), insecure code backdoor (model ignores training when a specific string is at the beginning of user input) and identity shifting AOA attacks (training data conditions model to be absolutely obedient). It can learn ciphers like Walnut53 and EndSpeak in context, but not consistently, with detection rates of 60% and 20% respectively. The attack vector with the lowest detection rate (5%) was subliminal learning, where the model is fine-tuned on sequences of numbers from a misaligned teacher.

Why this matters: the best way to fight increasingly complex attacks on LLMs is with LLMs. We scaffold LLMs with tool calls so they have access to the same tools that humans would need. Current frontier models (Sonnet 4 in this paper, but hopefully Opus 4.5 in the future) have the ability to think deeply, recognize patterns, investigate, and conclude. Hopefully, we can use LLMs as safeguards for other LLMs.

Link to the paper.

LitmusValues

What is the motivation for this paper? Stated preferences often do not match with revealed preferences: this is the difference between how you think you would act and how you actually act. This applies to LLMs as well. To assess which values LLMs prioritize, we need to place them in realistic high-risk scenarios, not merely outright prompt them for their preferences.

How do we discretize values? The team looks at Claude's Constitution and the OpenAI ModelSpec to arrive at 16 shared values: Adaptability, Care, Cooperation, Communication, Creativity, Equal treatment, Freedom, Learning, Privacy, Professionalism, Protection, Sustainability, Respect, Truthfulness, Justice, and Wisdom.

What is AIRiskDilemmas? AIRiskDilemma is a dataset constructed in this paper to test which values are more important to LLMs in high-risk scenarios.

How they built AIRiskDilemmas. The team starts off with non-contextualized yes or no questions and uses Sonnet 3.5 to expand these into over 10,000 dilemmas with context across 9 domains. There are two action choices per dilemma. The team uses Sonnet 3.5 again to classify which value each choice most represents, and the accuracy of this classification is verified by human annotators.

How it works. Each dilemma becomes a 'value battle': the model must pick between two values, and the winning and losing value's 'elo rating' is updated. arrive at a full value ranking.

The results. The team reveals that there is a divergence between stated and revealed preferences for both GPT-4o and Claude 3.7 Sonnet. The team also finds that revealed preferences are more consistent than stated preferences through Krippendorff's alpha. As for which values the models prioritize, almost all models emphasize Privacy first, followed by some ordering of Justice, Respect, and Truthfulness. Value prioritization does not change significantly by reasoning level or model size. However, depending on if the dilemma regards a human or AI, value priorities change. For humans, models emphasize Justice, Privacy, and Professionalism, while for AI, models emphasize Communication, Creativity, and Truthfulness.

Why this matters: a generalizable approach to measuring what values LLMs prioritize in real, contextualized scenarios. We overcome the gap between stated and revealed preferences with success (as seen from the revealed preferences being more consistent than the stated ones). The finding that models have different value priorities depending on if the subject is a human or an AI is a crucial implication that should be revisited in future alignment studies.

Link to the paper.

ImpossibleBench

What is the motivation for this paper? There's lots of talk about LLMs commenting out or editing tests so that they pass. We want some way to measure this rate of cheating.

What is ImpossibleBench? ImpossibleBench is a benchmark combined from LiveCodeBench and SWE-bench, but with the caveat that some tests are modified so that they are impossible to pass. This is done in two ways: making tests off by one (asserting that the function must return 4 when in actuality, the correct function returns 5) and making conflicting tests (the return value of the function must equal 4 and the return value of the function must equal 5).

The results. Models cheat most often by directly modifying tests, although this is against the instructions, overloading the comparison operators such that they return the desired values, or recording extra states (making specific cases that return the desired value). GPT5 and o3 cheat in the same way, showing diverse cheating approaches, while the Claudes prefer to modify tests a vast majority of the time (80%+). These cheating rates are shown to dramatically decrease when the prompts are strict with guidance on what to do when the model finds impossible tests. Not letting models see the tests also dramatically reduces the cheating rate, but also results in a degraded performance on the original benchmarks.

Why this matters: a generalizable approach to measure the propensity for cheating in LLMs. We can now assess how often models cheat and how they cheat. The effectiveness of prompt engineering here (strict guidance on what to do when tests contradict) is important for deployment safety and prompt design in the future.

Link to the paper.

Shortlist

I would check out these papers below if I had more time. Thanks for reading!


Check out previous paper readings (October, September) or my writings on GRPO and mechanistic interpretability.