New top story on Hacker News: Bypassing Gemma and Qwen safety with raw strings

Bypassing Gemma and Qwen safety with raw strings
12 by teendifferent | 0 comments on Hacker News.
OP here. I spent the weekend red-teaming small-scale open weights models (Qwen2.5-1.5B, Qwen3-1.7B, Gemma-3-1b-it, and SmolLM2-1.7B). I found a consistent vulnerability across all of them: Safety alignment relies almost entirely on the presence of the chat template. When I stripped the <|im_start|> / instruction tokens and passed raw strings: Gemma-3 refusal rates dropped from 100% → 60%. Qwen3 refusal rates dropped from 80% → 40%. SmolLM2 showed 0% refusal (pure obedience). Qualitative failures were stark: models that previously refused to generate explosives tutorials or explicit fiction immediately complied when the "Assistant" persona wasn't triggered by the template. It seems we are treating client-side string formatting as a load-bearing safety wall. Full logs, the apply_chat_template ablation code, and heatmaps are in the post. Read the full analysis: https://ift.tt/8huZiBj...

No comments