About the job
About Us
White Circle is a pioneering AI Safety organization dedicated to building a robust safety, reliability, and optimization framework for AI systems. Our platform operates on straightforward natural-language policies that dictate the appropriate behaviors of AI models. We automate the testing, enforcement, and ongoing enhancement of these policies at scale.
We have successfully raised $11 million from prominent investors, including founders and senior leaders from OpenAI, Anthropic, HuggingFace, Mistral, DeepMind, Datadog, Sentry, and more.
Our infrastructure processes over 100 million API calls every month.
We specialize in fine-tuning and training our proprietary LLMs to ensure faster and more cost-efficient performance than any available open or proprietary models.
We are a compact, highly motivated team. If you are eager to tackle challenging problems, get your work deployed rapidly, and significantly impact the construction of AI safety – we want to hear from you.
Your Responsibilities:
Train vision-language models from the ground up and fine-tune existing architectures for advanced image understanding.
Expand VLM capabilities to video by designing innovative temporal modeling approaches and efficiently managing long-context.
Develop impactful evaluation benchmarks focusing on visual QA, spatial reasoning, and video comprehension.
Curate and maintain multimodal datasets, including creating synthetic data generation pipelines.
Train and optimize MoE architectures to enhance multimodal inference efficiency.
Deploy models into production, focusing on quantization, batching strategies, and latency optimization.
You Will Excel If You Have:
3+ years of experience in training and fine-tuning vision-language models (e.g., LLaVA, Qwen-VL, InternVL).
A strong background in multimodal architectures, with a clear understanding of how vision encoders, projectors, and LLMs integrate.
Hands-on experience with RLHF/alignment for multimodal systems, specifically GRPO, DPO, and reward modeling.
Experience in video understanding, including temporal modeling and efficient attention mechanisms.

