Taiwan Visual-Language Model 70B
Continual pre-training of ViT-L + Llama3-Taiwan-70B on 50M images and 3B text tokens grounded in Taiwanese culture.
Overview
We continually pre-trained a multimodal system grounded in Taiwanese culture using ViT-L as the vision encoder and Llama3-Taiwan-70B as the language backbone. Pre-training data includes 50M image-text pairs and image-text-interleaved entries totaling 3B text tokens. The setup follows LLaVA-style architecture: image features from the vision encoder are projected and concatenated with language embeddings before being passed to the LLM.
Key finding
During development, we found the model maintains instruction-following behavior even without explicit instruction tuning. We attribute this to Llama3-Taiwan’s inherent instruction-aligned nature from its pretraining — the visual projection layer was sufficient to activate this capability on image-conditioned inputs.
Demo conversations
The model can recognize objects and answer questions about images. Below are some example outputs.
Current direction
Next steps include visual instruction tuning on Taiwan-specific QA and captioning data, along with more exhaustive evaluation across vision-language benchmarks and culturally grounded tasks.