Working papers in AI safety, LLM evaluation, and network security.
SAEGuardBench: Do SAE Features Help Detect Jailbreaks?
Working PaperMd A Rahman · 2026
Benchmark comparing 8 detection methods across 4 paradigms on 6 datasets and 4 LLMs (2B-70B). SAE features consistently hurt jailbreak detection. The Detection Gap is negative on every model tested. Fine-tuning with classification-aware objective nearly closes the gap (0.954 AUROC). Includes InterpGuard two-stage recipe.
AI SafetyMechanistic InterpretabilityJailbreak DetectionSparse Autoencoders
CompToolBench: Measuring the Compositional Tool-Use Gap in Large Language Models
Working PaperMd A Rahman · 2026
Evaluation framework testing where 18 LLMs fail across four complexity levels of tool use. The Selection Gap: single-action selection accuracy is systematically 13.2pp lower than multi-step composition across 17/18 models. Framework includes 200 tasks, 106 deterministic tool simulators.
LLM EvaluationTool UseBenchmark Design
MergeSafe: How Backdoors Survive Model Merging
Working PaperMd A Rahman · 2026
Testing whether model merging dilutes backdoors. LoBAM amplification brings attack success back to 83-99%. Three-signal pre-merge scanner achieves 100% recall with 0% false positives across 18 configurations.
AI SafetyModel MergingBackdoor Detection
TrafficLM: A 12-Year Concept Drift Study in Website Fingerprinting
Working Paper — targeting PoPETs 2027Md A Rahman · 2026
Measuring how website fingerprinting models decay over time. Training 7 classifiers on 2014 Tor traffic and testing on a fresh 2026 crawl of the same 100 sites. Deep Fingerprint CNN achieves 94.4% but accuracy drops under concept drift.
Network SecurityWebsite FingerprintingConcept Drift