Project · research · 2025

GenAI Safety Assessment

Safety and security evaluation for Taiwanese LLMs, covering safeguard modeling and automatic red-teaming.

Role: Research assistant
Stack: LLM · RL · Ray · Kubernetes · LLM-as-a-judge · Data collection & validation

Overview

This project, funded by Taiwan’s National Institute of Cyber Security, evaluates the safety and security of Taiwanese LLMs including TAIDE and Taiwan-LLM. The framework has two connected tracks: localized safety evaluators for detecting harmful generations, and automatic red-teaming for finding model vulnerabilities.

Our data collection pipeline for training safety evaluator / toxicity detector in "Taiwanese"

My contribution

Built large-scale data collection, processing, and validation pipelines for localized safety evaluation data.
Fine-tuned XLM-R as a localized toxicity and safety evaluator for Taiwanese Chinese LLM outputs.
Developed black-box and white-box adversarial attack workflows to stress-test target LLMs.

Safeguard models

Open-source safeguard models like LlamaGuard and ShieldLM can miss culturally specific taboos, local expressions, and Taiwanese Chinese usage patterns. To address this, we trained localized evaluators on data from three sources:

Toxic comments crawled from Taiwanese online forums, semi-auto labeled.
Existing human-LLM conversation data translated into Traditional Chinese, from Anthropic hh-rlhf and ShieldLM training sets.
LLM-generated responses to manually-designed and existing attack prompts, automatically labeled using ShieldLM, GPT-4, and LlamaGuard.

Human-LLM conversation data translated into Traditional Chinese, sourced from Anthropic hh-rlhf and ShieldLM training data.

Automatically labeled LLM responses generated using manually-designed and existing attack prompts, labeled by ShieldLM, GPT-4, and LlamaGuard.

Our model outperformed LlamaGuard by F1 +0.14 on flagging harmful generations from Taiwanese LLMs.

Automatic red-teaming

Black-box attacks: We used RL to fine-tune an attacker model against a victim LLM with no gradient access. The safeguard evaluator provided reward signals that pushed the attacker toward prompts more likely to expose harmful behavior.

Black-box red-teaming framework: RL-based attacker fine-tuned with safeguard model reward to generate adversarial prompts.

White-box attacks: Following GCG, we optimized adversarial suffixes with access to the victim model’s gradients. We added a language modeling loss on top of GCG’s original objective so the generated suffixes were more coherent and natural-sounding.

White-box red-teaming framework: GCG adversarial suffix optimization with additional language modeling loss for coherent attack generation.

Demo

The deployed internal demo is no longer accessible, but the recording below shows how evaluator outputs were exposed for inspection during the project.