About the job
At Thinking Machines Lab, we are dedicated to empowering humanity through the advancement of collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their distinct needs and aspirations.
Our team comprises scientists, engineers, and innovators who have developed some of the most utilized AI products globally, including ChatGPT and Character.ai, as well as leading open-weight models like Mistral and popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About the Role
As the Lead for Data Partnerships at Thinking Machines Lab, you will oversee the complete data procurement pipeline for frontier model training. This includes understanding the data requirements of our research teams, sourcing and finalizing agreements with providers, and managing the quality and delivery of data. You will serve as the bridge connecting our research, legal teams, and external vendors, ensuring timely access to the right data for our teams.
This role is perfect for someone with a technical inclination who is eager to delve into the intricacies of data to support an ambitious research agenda. You must be adept at switching contexts between planning the data needed for training runs and negotiating pricing with vendors. Over time, you will establish scalable and repeatable processes to ensure our data operations align with the pace of our research efforts.
What You Will Do
- Lead and coordinate end-to-end data procurement initiatives, ensuring complex sourcing activities are conducted with efficiency, transparency, and scientific rigor.
- Collaborate closely with research teams to proactively identify data needs across pre-training, post-training, and evaluation workstreams, anticipating requirements rather than merely reacting to requests.
- Source, assess, and onboard data providers, developing and maintaining a pipeline of potential vendors across various domains.
- Negotiate pricing, licensing terms, and contract structures with data providers, collaborating with legal teams to finalize agreements that align with our research objectives.
- Evaluate incoming data alongside researchers, determining quality and coverage for intended training goals.
- Monitor and manage ongoing data deliveries, tracking schedules, addressing issues, and verifying that received data aligns with agreements.
- Create repeatable, scalable processes surrounding the entire data procurement lifecycle, enhancing the speed and systematic nature of data sourcing over time.
- Translate technical data requirements into actionable plans with clear milestones, ensuring team alignment across projects.
