Publications

Working papers in AI safety, LLM evaluation, and network security.

SAEGuardBench: Do SAE Features Help Detect Jailbreaks?

Working Paper

Md A Rahman · 2026

Benchmark comparing 8 detection methods across 4 paradigms on 6 datasets and 4 LLMs (2B-70B). SAE features consistently hurt jailbreak detection. The Detection Gap is negative on every model tested. Fine-tuning with classification-aware objective nearly closes the gap (0.954 AUROC). Includes InterpGuard two-stage recipe.

AI SafetyMechanistic InterpretabilityJailbreak DetectionSparse Autoencoders

PDF GitHub Dataset Demo

CompToolBench: Measuring the Compositional Tool-Use Gap in Large Language Models

Working Paper

Md A Rahman · 2026

Evaluation framework testing where 18 LLMs fail across four complexity levels of tool use. The Selection Gap: single-action selection accuracy is systematically 13.2pp lower than multi-step composition across 17/18 models. Framework includes 200 tasks, 106 deterministic tool simulators.

LLM EvaluationTool UseBenchmark Design

PDF GitHub Dataset Demo

MergeSafe: How Backdoors Survive Model Merging

Working Paper

Md A Rahman · 2026

Testing whether model merging dilutes backdoors. LoBAM amplification brings attack success back to 83-99%. Three-signal pre-merge scanner achieves 100% recall with 0% false positives across 18 configurations.

AI SafetyModel MergingBackdoor Detection

PDF GitHub

TrafficLM: A 12-Year Concept Drift Study in Website Fingerprinting

Working Paper — targeting PoPETs 2027

Md A Rahman · 2026

Measuring how website fingerprinting models decay over time. Training 7 classifiers on 2014 Tor traffic and testing on a fresh 2026 crawl of the same 100 sites. Deep Fingerprint CNN achieves 94.4% but accuracy drops under concept drift.

Network SecurityWebsite FingerprintingConcept Drift

PDF GitHub