Machine Learning Systems Engineer, Research Tools
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Qualifications
About Anthropic
Anthropic is dedicated to advancing the field of artificial intelligence while ensuring safety, reliability, and interpretability in AI systems. Our mission is to create AI technologies that are not only powerful but also beneficial for users and society at large. We are a rapidly growing team of passionate researchers, engineers, policy experts, and business leaders collaborating to build the next generation of trustworthy AI systems.
Similar jobs
Search for Machine Learning Engineer Distributed Data Systems
6,250 results
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. Our hybrid research and product team is dedicated to seamlessly integrating multimodal functionalities into our AI solutions, ensuring they are dependable, user-centric, and aligned with our vision of benefiting society at large.Role OverviewAs a Machine Learning Engineer specializing in Distributed Data Systems, you will be instrumental in designing and scaling the infrastructure that facilitates large-scale multimodal training and evaluation at OpenAI. Your role will involve managing complex distributed data pipelines, collaborating closely with researchers to convert their requirements into robust, production-ready systems, and enhancing pipelines that are essential for Sora's rapid iteration cycles.We are seeking detail-oriented engineers with extensive experience in distributed systems who thrive in high-stakes environments and excel in building resilient infrastructure.This position is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Design, implement, and maintain data infrastructure systems, including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning systems, with a focus on scalability, reliability, and security.Ensure our data platform can scale exponentially while maintaining high reliability and efficiency.Collaborate with researchers to gain a deep understanding of their requirements, translating them into production-ready systems.Strengthen, optimize, and manage critical data infrastructure systems that support multimodal training and evaluation.You Will Excel in This Role If You:Possess strong experience with distributed systems and large-scale infrastructure, coupled with a keen interest in data.Exhibit meticulous attention to detail and a commitment to building and maintaining reliable systems.Demonstrate solid software engineering fundamentals and effective organizational skills.Thrive in environments characterized by ambiguity and rapid change.About OpenAIOpenAI is a trailblazing AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves humanity. We continuously push the boundaries of AI capabilities and strive to create technology that benefits everyone.
Pluralis Research
OverviewPluralis Research is at the forefront of innovation in Protocol Learning, specializing in the collaborative training of foundational models. Our approach ensures that no single participant ever has or can obtain a complete version of the model. This initiative aims to create community-driven, collectively owned frontier models that operate on self-sustaining economic principles.We are seeking experienced Senior or Staff Machine Learning Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. In this role, you will design and implement a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.
About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.
About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.
P-186 At Databricks, we are passionate about empowering data teams to tackle some of the world’s most challenging problems, from security threat detection to cancer drug development. Our mission is to build and operate the leading data and AI infrastructure platform, enabling our customers to concentrate on the high-value challenges that are integral to their own objectives. Founded in 2013 by the original creators of Apache Spark™, Databricks has rapidly evolved from a small office in Berkeley, California, to a global powerhouse with over 1000 employees. Trusted by thousands of organizations, from startups to Fortune 100 companies, we are recognized as one of the fastest-growing SaaS companies worldwide. Our engineering teams create highly sophisticated products that address significant needs in the industry. We continuously push the limits of data and AI technology while maintaining the resilience, security, and scalability essential for our customers' success on our platform. We manage one of the largest-scale software platforms, consisting of millions of virtual machines that generate terabytes of logs and process exabytes of data daily. At this scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must effectively shield our customers from these challenges. Modern data analysis leverages advanced techniques, such as machine learning, that far exceed the capabilities of traditional SQL query engines. As a Software Engineer on the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems that outshine specialized SQL query engines in relational query performance, while providing the flexibility and programming abstractions to support a variety of workloads, from ETL to data science. Examples of projects you may work on include: Apache Spark™: Contributing to the de facto open-source framework for big data. Data Plane Storage: Developing reliable, high-performance services and client libraries for storing and accessing vast amounts of data on cloud storage backends like AWS S3 and Azure Blob Store. Delta Lake: A storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, featuring low latency streaming. Its higher-level abstractions and guarantees, including ACID transactions and time travel, significantly reduce the complexity of real-world data engineering architectures. Delta Pipelines: Aiming to simplify the management of data engineering pipelines.
Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.
At Philo, we are a dedicated team of technology and product enthusiasts committed to reshaping the television landscape. We blend cutting-edge technology with the captivating medium of television to create the ultimate viewing experience. Our mission is to enhance streaming capabilities through innovative cloud delivery and sophisticated machine learning algorithms that personalize content discovery. As a Senior Machine Learning Engineer specializing in Recommendation Systems, you will be at the forefront of our content personalization initiatives, significantly enhancing user engagement and satisfaction. Your expertise will help ensure that every time users open the Philo app, they find something they want to watch. In this pivotal role, you will spearhead the development of advanced algorithms and large-scale systems that drive Philo's recommendation engine. Collaborating closely with data science, product, infrastructure, and backend engineering teams, you will tackle complex machine learning challenges and develop innovative, data-driven solutions that enhance content discovery and foster user retention.
Join Anthropic as a Machine Learning Systems Engineer within our Encodings and Tokenization team, where you'll play a pivotal role in refining and optimizing our tokenization systems across Pretraining and Finetuning workflows. By bridging the gap between our Pretraining and Finetuning teams, you will help shape the essential infrastructure that enhances how our AI models learn from diverse data. Your contributions will be crucial in ensuring our AI systems remain reliable, interpretable, and steerable, driving forward our mission of developing beneficial AI technologies.
Atomic Semi
About Atomic SemiAtomic Semi is revolutionizing the semiconductor industry by creating a compact, high-speed semiconductor fabrication facility. Leveraging existing technology and innovative simplifications, we are committed to developing our own tools to enable rapid iteration and continuous improvement.We are assembling a select group of talented, hands-on engineers across various disciplines—mechanical, electrical, hardware, computer, and process engineering. Our mission is to maintain ownership of the entire technological stack, from atomic structures to architectural design. Our optimistic team is dedicated to pushing the boundaries of technology.We believe that smaller, faster, and self-created solutions are superior. Our lab is equipped with advanced 3D printers, diverse microscopes, e-beam writers, and general fabrication tools. If we identify any gaps, we will create the necessary innovations ourselves.Founded by Sam Zeloof and Jim Keller, Atomic Semi combines Sam's garage chip-making expertise with Jim's 40 years of leadership in the semiconductor industry.About the RoleAs a Data & Machine Learning Engineer, you will develop systems that utilize fabrication data to enhance, monitor, and optimize our manufacturing processes. This role primarily involves converting raw data into actionable insights that drive improvements in yield, reliability, and throughput.
About KreaKrea is at the forefront of developing advanced AI creative tools designed to enhance and empower human creativity. Our mission is to create intuitive and controllable AI solutions that allow creatives to express themselves across various formats including text, images, video, sound, and 3D.About the PositionWe are seeking a talented Machine Learning Engineer to lead the design and implementation of Krea’s personalization and recommendation systems from the ground up. You will take full ownership of how we comprehend user preferences, curate engaging content, and customize generative models to reflect individual aesthetics.This role sits at the exciting intersection of recommendation systems, representation learning, and generative imaging and video technologies.Your ResponsibilitiesLead the architecture and development of Krea’s personalization and recommendation framework, overseeing the technical direction from inception to deployment.Craft algorithms that effectively model user preferences and tastes, enabling our systems to adapt to individual styles and aesthetics.Develop high-quality, curated feeds that strike a balance between exploration, personalization, and aesthetic coherence.Collaborate closely with our model and research teams to co-create personalization mechanisms that shape how our generative models learn, adapt, and express creative styles.Contribute to research in personalized image generation, with a focus on style, taste, and subjective quality.Work in tandem with product, design, and research teams to define what “good personalization” means in a creative context.Take systems from initial research and prototyping stages through to production, ongoing iteration, and enhancement.
Ando Technologies
Join Ando Technologies as a Machine Learning Engineer specializing in AI-native systems and forecasting. In this role, you will leverage cutting-edge machine learning algorithms to develop predictive models and enhance our AI-driven solutions. Collaborate with cross-functional teams to transform data into actionable insights and drive strategic decisions. Ideal candidates will have a passion for innovation and a strong understanding of AI technologies.
Matter Intelligence
Join Matter Intelligence as a Data and Machine Learning Infrastructure Engineer, where you will play a pivotal role in shaping the future of data-driven decision-making. You will be part of a dynamic team focused on building and optimizing infrastructure that supports innovative machine learning applications. Your expertise will help us enhance our data pipelines and ensure seamless integration of machine learning models into production.
Scale AI, Inc.
Join Scale AI's ML platform team (RLXF) as a Machine Learning Research Engineer, where you will play a pivotal role in developing our advanced distributed framework for training and inference of large language models. This platform is vital for enabling machine learning engineers, researchers, data scientists, and operators to conduct rapid and automated training, as well as evaluation of LLMs and data quality.At Scale, we occupy a unique position in the AI landscape, serving as an essential provider of training and evaluation data along with comprehensive solutions for the entire ML lifecycle. You will collaborate closely with Scale's ML teams and researchers to enhance the foundational platform that underpins our ML research and development initiatives. Your contributions will be crucial in optimizing the platform to support the next generation of LLM training, inference, and data curation.If you are passionate about driving the future of AI through groundbreaking innovations, we want to hear from you!
At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.
Applied Compute
About UsAt Applied Compute, we specialize in creating Specific Intelligence solutions for enterprises, developing agents that learn continuously from an organization’s processes, data, expertise, and objectives. We recognize a significant gap between the capabilities of AI models in isolation and their practical applications in real-world business contexts. Our systems often fall short because they lack adaptability to feedback. To address this, we are building a continual learning infrastructure that captures context, memory, and decision-making processes throughout the enterprise, enabling specialized agents to effectively execute real tasks.What Excites Us: We operate at a unique intersection where our product team constructs the platform that fuels a new generation of digital coworkers. Our research team pushes the boundaries of post-training and reinforcement learning, creating innovative product experiences. Our applied research engineers collaborate closely with clients to deploy models into production. This blend of strong product focus, deep research, and hands-on customer engagement is crucial for integrating AI into the enterprise. We are product-driven, research-informed, and actively engaged with our clients.Our Team: Our diverse team consists of engineers, researchers, and operators, many of whom are former founders. We have built RL infrastructure at leading organizations like OpenAI and Scale AI, and developed systems at Together, Two Sigma, and Watershed. We proudly serve Fortune 50 clients alongside companies like DoorDash, Mercor, and Cognition. Our work is supported by renowned investors, including Benchmark, Sequoia, and Lux.Who Thrives in Our Environment: We seek individuals eager to apply cutting-edge research and complex systems to tackle real-world challenges. You should be adept at quickly adapting to new environments, whether it’s a fresh codebase, a client’s data architecture, or an unfamiliar problem domain. A genuine enjoyment of customer interactions—listening, empathizing, and understanding how tasks are accomplished within their organizations—is essential. Those with entrepreneurial backgrounds, extensive side projects, or demonstrated end-to-end ownership typically excel in our company.
Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.
Join Hilbert, a pioneering data science-driven growth engine that empowers B2C teams with predictive insights into user behaviors, revenue drivers, and sustainable growth strategies. Our innovative approach compresses lengthy decision-making processes into mere minutes.Trusted by Fortune 10 enterprises and beloved brands like FreshDirect, Blank Street, and Levain Bakery, Hilbert is the backbone of their growth strategies. We are also collaborating with leading AI companies to push the boundaries of what’s possible.We are seeking a talented Data Scientist who possesses a deep understanding of B2C business challenges, develops actionable models using real-world data, and delivers impactful analyses that facilitate significant growth outcomes — all with the initiative and urgency typical of a founder.This is not a role where you simply receive tasks; you will take ownership of problems from start to finish — from problem framing and modeling to measuring impact — for enterprise clients where the stakes are high and feedback is rapid. If you understand the nuances of churn analysis for different sectors, can create effective recommendation systems from sparse data, and can clearly communicate your causal analysis to clients, we want to meet you.ROLE OVERVIEWYou will closely collaborate with the founding team, engineering, product, and go-to-market teams to enhance the data science systems integral to Hilbert. Daily responsibilities include building models, conducting experiments, analyzing data, and producing analyses that influence key decisions. Our focus is B2C, and the challenges we tackle — such as demand forecasting, customer lifecycle management, personalization, and activation — require an individual who can translate business contexts into effective modeling choices. You will thrive in a high-autonomy, high-ambiguity environment where data is often messy, incomplete, or scarce.Key Responsibilities:Develop ML models that enhance core product features: recommendation systems, search relevance, customer segmentation, demand forecasting, and activation optimization.Contribute to configurable, multi-tenant model architectures that adapt to various customer contexts and business needs, avoiding the need for custom solutions for each case.Build effective models using available data — leveraging limited, noisy, or sparse datasets while determining the appropriate level of complexity.Design and implement rigorous A/B tests and recognize when causal inference methods are necessary.
Cloudflare, Inc.
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, where you will play a pivotal role in developing analytics and alert systems that enhance our services. You will collaborate with a talented team to design scalable and efficient systems to manage and analyze vast amounts of data. Your work will directly impact the performance and reliability of our offerings, ensuring our customers have the best possible experience.
About UsAt XOXO AI, we are at the forefront of innovation, crafting intelligent interfaces that seamlessly integrate into everyday life. As a dynamic research lab comprised of dedicated engineers, designers, and researchers, we tackle unique challenges that extend beyond the workplace.Having achieved significant breakthroughs in infrastructure, architecture, and model layers, we are looking for passionate builders to help us realize our vision through the development of robust interface and application layers.About the RoleWe seek a talented Data/Machine Learning Engineer to establish our data infrastructure and production-ready ML systems, ensuring our product is responsive, dependable, and intelligent. This full-cycle role involves designing high-throughput pipelines, defining resilient data models, and deploying low-latency feature and model serving that can withstand real-world demands.You will collaborate closely with our founders and the early engineering team to transition prototypes into production, transforming complex real-world signals into reliable datasets and real-time functionalities that enhance core product experiences.What You’ll DoDevelop and manage high-throughput batch and streaming pipelines for analytics, training, and product signals.Lead real-time feature pipelines and online feature serving for low-latency inference.Design and oversee dimensional data models, skillfully managing schema evolution to avoid disrupting downstream consumers.Optimize model serving infrastructure to meet stringent latency and reliability service level objectives (SLOs).Establish and enforce event schemas, telemetry standards, and data contracts across multiple teams.Collaborate with engineering, product, and research teams to translate ambiguous product requirements into measurable, sustainable systems.
Join Us in Building a Safer World.At TRM Labs, we specialize in blockchain analytics and AI solutions aimed at assisting law enforcement, national security agencies, financial institutions, and cryptocurrency businesses in identifying, investigating, and preventing crypto-related fraud and financial crime. Our innovative platforms leverage blockchain intelligence and AI technology to trace funds, detect illicit activity, and construct comprehensive threat profiles. Trusted by leading organizations worldwide, TRM Labs is committed to enabling a safer and more secure environment for all.Our AI Engineering Team is dedicated to pioneering next-generation AI applications, particularly in the realm of Large Language Models (LLMs) and agentic systems. Our goal is to develop resilient pipelines and high-performance infrastructure that facilitate the swift, safe, and scalable deployment of AI systems.We manage extensive petabyte-scale pipelines, ensuring model serving with millisecond latency while providing the necessary observability and governance to make AI production-ready. Our team actively evaluates and integrates leading-edge tools in the LLM and agent space, including open-source stacks, vector databases, evaluation frameworks, and orchestration tools to accelerate TRM’s innovation pace.As a Senior or Staff ML Systems Engineer – LLM, you will play a pivotal role in constructing and scaling our technical infrastructure for AI/ML systems. Your responsibilities will include:Creating reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools such as Langfuse, GitHub Actions, and experiment tracking.Automating model versioning, approval processes, and compliance checks across various environments.Developing a modular and scalable AI infrastructure stack that encompasses vector databases, feature stores, model registries, and observability tools.Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.Continuously assessing and incorporating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).Promoting AI reliability and governance while enabling experimentation, ensuring compliance, security, and continuous uptime.Enhancing AI/ML Model Performance and ensuring data accuracy and consistency, leading to improved model training and inference.Implementing infrastructure to facilitate both offline and online evaluation of LLMs and agents.
Sign in to browse more jobs
Create account — see all 6,250 results

