← Home
Project · ongoing · 2025

Taiwan Visual-Language Model 70B

Continual pre-training of ViT-L + Llama3-Taiwan-70B on 50M images and 3B text tokens grounded in Taiwanese culture.

Role
Research assistant
Stack
VLM · multimodal pre-training · Megatron-LM · NemoCurator · DeepSpeed

Overview

We continually pre-trained a multimodal system grounded in Taiwanese culture using ViT-L as the vision encoder and Llama3-Taiwan-70B as the language backbone. Pre-training data includes 50M image-text pairs and image-text-interleaved entries totaling 3B text tokens. The setup follows LLaVA-style architecture: image features from the vision encoder are projected and concatenated with language embeddings before being passed to the LLM.

Key finding

During development, we found the model maintains instruction-following behavior even without explicit instruction tuning. We attribute this to Llama3-Taiwan’s inherent instruction-aligned nature from its pretraining — the visual projection layer was sufficient to activate this capability on image-conditioned inputs.

Demo conversations

The model can recognize objects and answer questions about images. Below are some example outputs.

Model output for bubble tea image
Q: Where does this food come from? — A: Bubble tea is Taiwan's popular food.
Model output for castle image
Q: Describe what's in the image. — A: A large castle is covered in snow as seen from a tree.
Model output for Notre Dame
Q: What's this building? — A: Notre Dame de Paris.
Model output for University of Waterloo logo
Q: Where can I see this logo? — A: University of Waterloo, Ontario, Canada.

Current direction

Next steps include visual instruction tuning on Taiwan-specific QA and captioning data, along with more exhaustive evaluation across vision-language benchmarks and culturally grounded tasks.