GenAI Safety Assessment
Safety and security evaluation for Taiwanese LLMs, covering safeguard modeling and automatic red-teaming.
Overview
This project, funded by Taiwan’s National Institute of Cyber Security, evaluates the safety and security of Taiwanese LLMs including TAIDE and Taiwan-LLM. The framework has two connected tracks: localized safety evaluators for detecting harmful generations, and automatic red-teaming for finding model vulnerabilities.
My contribution
- Built large-scale data collection, processing, and validation pipelines for localized safety evaluation data.
- Fine-tuned XLM-R as a localized toxicity and safety evaluator for Taiwanese Chinese LLM outputs.
- Developed black-box and white-box adversarial attack workflows to stress-test target LLMs.
Safeguard models
Open-source safeguard models like LlamaGuard and ShieldLM can miss culturally specific taboos, local expressions, and Taiwanese Chinese usage patterns. To address this, we trained localized evaluators on data from three sources:
- Toxic comments crawled from Taiwanese online forums, semi-auto labeled.
- Existing human-LLM conversation data translated into Traditional Chinese, from Anthropic hh-rlhf and ShieldLM training sets.
- LLM-generated responses to manually-designed and existing attack prompts, automatically labeled using ShieldLM, GPT-4, and LlamaGuard.
Our model outperformed LlamaGuard by F1 +0.14 on flagging harmful generations from Taiwanese LLMs.
Automatic red-teaming
Black-box attacks: We used RL to fine-tune an attacker model against a victim LLM with no gradient access. The safeguard evaluator provided reward signals that pushed the attacker toward prompts more likely to expose harmful behavior.
White-box attacks: Following GCG, we optimized adversarial suffixes with access to the victim model’s gradients. We added a language modeling loss on top of GCG’s original objective so the generated suffixes were more coherent and natural-sounding.
Demo
The deployed internal demo is no longer accessible, but the recording below shows how evaluator outputs were exposed for inspection during the project.