Project · ongoing · 2025

Taiwan Visual-Language Model 70B

Continual pre-training of ViT-L + Llama3-Taiwan-70B on 50M images and 3B text tokens grounded in Taiwanese culture.

Role: Research assistant
Stack: VLM · multimodal pre-training · Megatron-LM · NemoCurator · DeepSpeed

Overview

We continually pre-trained a multimodal system grounded in Taiwanese culture using ViT-L as the vision encoder and Llama3-Taiwan-70B as the language backbone. The pre-training corpus included 50M image-text pairs and image-text-interleaved entries totaling 3B text tokens. The setup follows LLaVA-style architecture: image features from the vision encoder are projected and concatenated with language embeddings before being passed to the LLM.

My contribution

Built and validated large-scale multimodal data for VLM pre-training, including image-text and interleaved image-text data.
Supported the VLM pre-training workflow around ViT-L, Llama3-Taiwan-70B, Megatron-LM, NemoCurator, and DeepSpeed.
Analyzed qualitative model outputs to identify instruction-following behavior and gaps for later visual instruction tuning.

Key finding

During development, we found that the model retained instruction-following behavior even without explicit visual instruction tuning. We attribute this to Llama3-Taiwan’s instruction-aligned language backbone: once visual features were projected into the language model’s embedding space, the model could often apply its existing instruction-following ability to image-conditioned inputs.

Demo conversations

The model can recognize objects, landmarks, and culturally specific food. Below are example outputs from image-conditioned conversations.

Model output for bubble tea image — Q: Where does this food come from? — A: Bubble tea is Taiwan's popular food.

Model output for castle image — Q: Describe what's in the image. — A: A large castle is covered in snow as seen from a tree.

Model output for Notre Dame — Q: What's this building? — A: Notre Dame de Paris.

Q: Where can I see this logo? — A: University of Waterloo, Ontario, Canada.

Current direction

Next steps include visual instruction tuning on Taiwan-specific QA and captioning data, along with more systematic evaluation across vision-language benchmarks and culturally grounded tasks.