About the job
Sciforium is an innovative AI infrastructure company specializing in the development of state-of-the-art multimodal AI models, alongside a proprietary high-efficiency serving platform. With substantial financial backing and direct collaboration with AMD, including hands-on support from AMD engineers, our rapidly growing team is dedicated to building the comprehensive stack that powers cutting-edge AI models and real-time applications.
Role Overview
We are on the lookout for a highly skilled and visionary Data Scientist to spearhead the strategy and creation of vast datasets essential for our foundational models. In the realm of Large Language Models (LLMs), we recognize that data is the key competitive advantage. This role will encompass the entire data lifecycle—from extensive web-scale crawling to the meticulous creation of human-aligned datasets that dictate model behavior.
The ideal candidate will embrace data as both a large-scale engineering challenge and a complex analytical puzzle. Your responsibilities will extend beyond simply delivering data; you will design taxonomies, filtering heuristics, and post-training pipelines to ensure our models excel in reasoning, safety, and multimodal comprehension.
Key Responsibilities
Foundation Dataset Strategy: Oversee the comprehensive creation of pre-training datasets for LLMs, defining the optimal mix of web data, code, literature, and technical documents to enhance downstream model performance.
Petabyte-Scale Curation: Innovate and implement advanced pipelines for data cleaning, deduplication (exact and fuzzy), and high-quality signal extraction from vast amounts of unstructured data.
Post-Training & Alignment Data: Direct the creation of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO).
Multimodal Expansion: Lead the acquisition and processing of vision and video data, addressing the challenges of multimodal alignment, video compression, and temporal data consistency.
High-Performance Engineering: Create high-throughput data processing scripts utilizing Python, employing multiprocessing and multithreading to manage large-scale ingestion and transformation without performance bottlenecks.
Data Profiling & Analysis: Perform in-depth statistical analysis on training datasets to uncover biases, knowledge gaps, and quality regressions, ensuring a mathematically balanced model diet.
Synthetic Data Generation: (Added Value) Develop pipelines to generate high-quality synthetic datasets that enhance model training and capabilities.

