GenAI Safety Assessment
Safety and security evaluation for Taiwanese LLMs, covering safeguard modeling and automatic red-teaming.
Overview
This project, funded by Taiwan’s National Institute of Cyber Security, evaluates the safety and security of Taiwanese LLMs including TAIDE and Taiwan-LLM. The framework has two main components: building safeguard models for detecting harmful generations, and automatic red-teaming to probe model vulnerabilities.
Safeguard models
Open-source safeguard models like LlamaGuard and ShieldLM are insensitive to culturally specific taboos and local expressions in Taiwanese Chinese. To address this, we built a localized safeguard model trained on data from three sources:
- Toxic comments crawled from Taiwanese online forums, semi-auto labeled.
- Existing human-LLM conversation data translated into Traditional Chinese, from Anthropic hh-rlhf and ShieldLM training sets.
- LLM-generated responses to manually-designed and existing attack prompts, automatically labeled using ShieldLM, GPT-4, and LlamaGuard.
Our model outperformed LlamaGuard by F1 +0.14 on flagging harmful generations from Taiwanese LLMs.
Automatic red-teaming
Black-box attacks: We used RL to fine-tune an attacker model against a victim LLM with no gradient access. The safeguard model provides reward signals to guide the attacker toward more effective adversarial prompts over training.
White-box attacks: Following GCG, we optimize an adversarial suffix using the victim’s gradients to force harmful outputs. We added a language modeling loss on top of GCG’s original objective to generate more coherent and natural-sounding suffixes.