companySciforium logo

Data Scientist at Sciforium | San Francisco

SciforiumSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

The ideal candidate should possess a strong background in data science and engineering, with experience in handling large datasets and familiarity with machine learning frameworks. A Bachelor's degree in Computer Science, Data Science, Statistics, or a related field is preferred. Proficiency in Python and experience with data processing libraries are essential.

About the job

Sciforium is an innovative AI infrastructure company specializing in the development of state-of-the-art multimodal AI models, alongside a proprietary high-efficiency serving platform. With substantial financial backing and direct collaboration with AMD, including hands-on support from AMD engineers, our rapidly growing team is dedicated to building the comprehensive stack that powers cutting-edge AI models and real-time applications.

Role Overview

We are on the lookout for a highly skilled and visionary Data Scientist to spearhead the strategy and creation of vast datasets essential for our foundational models. In the realm of Large Language Models (LLMs), we recognize that data is the key competitive advantage. This role will encompass the entire data lifecycle—from extensive web-scale crawling to the meticulous creation of human-aligned datasets that dictate model behavior.

The ideal candidate will embrace data as both a large-scale engineering challenge and a complex analytical puzzle. Your responsibilities will extend beyond simply delivering data; you will design taxonomies, filtering heuristics, and post-training pipelines to ensure our models excel in reasoning, safety, and multimodal comprehension.

Key Responsibilities

  • Foundation Dataset Strategy: Oversee the comprehensive creation of pre-training datasets for LLMs, defining the optimal mix of web data, code, literature, and technical documents to enhance downstream model performance.

  • Petabyte-Scale Curation: Innovate and implement advanced pipelines for data cleaning, deduplication (exact and fuzzy), and high-quality signal extraction from vast amounts of unstructured data.

  • Post-Training & Alignment Data: Direct the creation of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO).

  • Multimodal Expansion: Lead the acquisition and processing of vision and video data, addressing the challenges of multimodal alignment, video compression, and temporal data consistency.

  • High-Performance Engineering: Create high-throughput data processing scripts utilizing Python, employing multiprocessing and multithreading to manage large-scale ingestion and transformation without performance bottlenecks.

  • Data Profiling & Analysis: Perform in-depth statistical analysis on training datasets to uncover biases, knowledge gaps, and quality regressions, ensuring a mathematically balanced model diet.

  • Synthetic Data Generation: (Added Value) Develop pipelines to generate high-quality synthetic datasets that enhance model training and capabilities.

About Sciforium

At Sciforium, we're at the forefront of AI technology, committed to advancing the capabilities of AI through innovative solutions. Our collaborative environment fosters creativity and technical excellence, making us a leader in the AI infrastructure space.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.