Research Engineer Infrastructure jobs in San Francisco – Browse 5,626 openings on RoboApply Jobs

Research Engineer, Infrastructure

CognitionSan Francisco Bay Area

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Entry Level

Qualifications

Ideal candidates will possess the following qualifications:Strong analytical skills with a focus on problem-solvingExperience in research methodologies and data analysisProficiency in programming languages relevant to infrastructure researchExcellent communication skills, both verbal and writtenA degree in Engineering, Computer Science, or a related field

About the job

Join our dynamic team at Cognition as a Research Engineer specializing in Infrastructure. In this role, you will be at the forefront of cutting-edge research, contributing to innovative solutions that shape the future of our infrastructure projects.

Your responsibilities will include conducting thorough research, analyzing data, and collaborating with cross-functional teams to implement effective strategies. We are looking for an individual who is passionate about technology and infrastructure, eager to solve complex problems, and ready to drive impactful results.

About Cognition

Cognition is a leader in the technology sector, dedicated to pushing the boundaries of innovation. Our team is composed of forward-thinkers who are passionate about leveraging technology to create efficient infrastructure solutions. We provide a collaborative environment where our employees can thrive and contribute to meaningful projects.

Similar jobs

1 - 20 of 5,626 Jobs

Select all on this page (20)

Apply

Research Engineer, Infrastructure

Cognition

Full-time|On-site|San Francisco Bay Area

Join our dynamic team at Cognition as a Research Engineer specializing in Infrastructure. In this role, you will be at the forefront of cutting-edge research, contributing to innovative solutions that shape the future of our infrastructure projects.Your responsibilities will include conducting thorough research, analyzing data, and collaborating with cross-functional teams to implement effective strategies. We are looking for an individual who is passionate about technology and infrastructure, eager to solve complex problems, and ready to drive impactful results.

Apr 8, 2026

Apply

Research Infrastructure Engineer for Training Systems

OpenAI

Full-time|On-site|San Francisco

OpenAI's research infrastructure group creates and maintains the backbone systems for advanced machine learning model training. This team often goes beyond conventional training methods, developing new infrastructure to support novel research at scale. Their work closely connects systems engineering with research progress, making it possible to run experiments that would otherwise be too slow or complex. Role overview The Research Infrastructure Engineer for Training Systems designs and improves the platforms that power large-scale ML training. This role bridges research concepts and the practical systems that make large model training possible. The work has a direct impact on model release timelines and requires building systems that perform reliably in demanding, real-world scenarios. What you will do Build and maintain infrastructure for large-scale model training and experimentation Design APIs and interfaces to simplify complex training workflows and prevent misuse Enhance reliability, debuggability, and performance across training and data pipelines Troubleshoot issues involving Python, PyTorch, distributed systems, GPUs, networking, and storage Create tests, benchmarks, and diagnostic tools to catch regressions early Requirements Interest in building systems that support new training methods, not just optimizing existing ones Strong instincts in systems engineering, especially regarding performance, reliability, and clean abstractions Experience designing APIs and interfaces for researchers and engineers Ability to work across ML research code and production infrastructure Enjoys evidence-based debugging using profiles, traces, logs, tests, and reproducible cases

Apr 27, 2026

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are on a mission to empower humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and objectives.We are a diverse team of scientists, engineers, and builders responsible for developing some of the most influential AI products on the market, such as ChatGPT and Character.ai. Our contributions extend to open-weight models like Mistral and popular open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking talented engineers to join our team and develop the libraries and tools that will accelerate research efforts at Thinking Machines. You will take charge of our internal infrastructure—creating evaluation libraries, reinforcement learning training libraries, and experiment tracking platforms—while building systems that enhance research velocity over time.This position emphasizes collaboration. You will work closely with researchers to identify bottlenecks and pain points, ensuring that they trust your systems to function seamlessly and find them enjoyable to use.What You'll DoDesign, build, and manage research infrastructure, including evaluation frameworks, RL training systems, experiment tracking platforms, visualization tools, and shared utilities.Develop high-throughput, scalable pipelines for distributed evaluation, reward modeling, and multimodal assessment.Establish systems for reproducibility, traceability, and robust quality control across research experiments and model training runs, implementing effective monitoring and observability.Collaborate directly with researchers to identify bottlenecks and unlock new capabilities, managing research tools like a product manager by proactively seeking feedback and tracking adoption.Work alongside infrastructure, data, and product teams to integrate tools across the technical stack.

Feb 3, 2026

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are dedicated to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone can leverage AI to meet their unique needs and aspirations.Our talented team comprises scientists, engineers, and innovators who have developed some of the most widely recognized AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral and popular open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking a motivated Infrastructure Research Engineer to design, enhance, and scale the systems that underpin large AI models. Your contributions will significantly improve inference speed, cost-effectiveness, reliability, and reproducibility, allowing our teams to concentrate on enhancing model capabilities rather than dealing with bottlenecks.Our mission centers on delivering high-performance and efficient model inference to support real-world applications and accelerate research efforts. In this role, you will be responsible for the infrastructure that guarantees smooth operation for every experiment, evaluation, and deployment at scale.Note: This is an evergreen role, kept open continuously to express interest. We receive numerous applications and may not always have an immediate opening that aligns perfectly with your skills and experience. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but we kindly ask that you avoid applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams, in which case you are welcome to apply directly in addition to this evergreen role.What You Will DoCollaborate with researchers and engineers to transition cutting-edge AI models into production.Partner with research teams to ensure high-performance inference for innovative architectures.Design and implement new techniques, tools, and architectures that enhance performance, latency, throughput, and efficiency.Optimize our codebase and computing resources (e.g., GPUs) to maximize hardware FLOPs, bandwidth, and memory usage.Extend orchestration frameworks (e.g., Kubernetes, Ray, SLURM) for distributed inference, evaluation, and large-batch serving.Establish standards for reliability, observability, and reproducibility throughout the inference stack.Publish and share insights through internal documentation, open-source libraries, or technical reports that further the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Infrastructure Research Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, we are committed to empowering humanity by advancing collaborative general intelligence. Our vision is to create a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and aspirations.Our team comprises scientists, engineers, and builders who have developed some of the most utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral. We also contribute to notable open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to enhance, scale, and fortify the systems supporting Tinker. This role will enable our internal teams and external clients to fine-tune models seamlessly, reliably, and cost-effectively. You will work at the intersection of large-scale training systems and product infrastructure, creating multi-tenant scheduling, storage, observability, and reliability features within a developer-friendly API.Your contributions will allow all Tinker users to concentrate on research and development without the burden of infrastructure concerns.Note: This is an evergreen position that we keep open for ongoing interest. We receive numerous applications, and there may not always be a role that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and will reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every 6 months. We also post specific roles for unique project or team needs, and you are welcome to apply directly to those in addition to this evergreen listing.What You’ll DoDesign and implement distributed job orchestration, placement, preemption, and fair-share scheduling to enhance Tinker for multi-tenant workloads.Optimize GPU utilization, throughput, and reliability across clusters (including autoscaling, bin-packing, and quotas).Develop reusable frameworks and libraries to enhance Tinker’s transparency, reproducibility, and performance.Collaborate with researchers and developer experience engineers to transform fine-tuning challenges into product features.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports to advance the field of scalable AI infrastructure.

Nov 27, 2025

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to harness AI for their unique needs and goals.Our team comprises scientists, engineers, and builders who have developed some of the most widely utilized AI products, such as ChatGPT and Character.ai, alongside open-weight models like Mistral, and popular open-source initiatives like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the PositionWe are seeking an Infrastructure Research Engineer to design and construct the foundational systems that facilitate the scalable and efficient training of large models for both deployment and research purposes. Your primary objective will be to streamline experimentation and training at Thinking Machines, enabling our research teams to concentrate on scientific advancements rather than system limitations.This role is a perfect match for an individual who possesses a strong blend of deep systems expertise and a keen interest in machine learning at scale. You will take full ownership of the training stack, ensuring that every GPU cycle contributes to scientific progress.Note: This is an evergreen role that we keep open continuously to express interest. We receive numerous applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. However, we encourage you to apply. We regularly review applications and reach out to candidates as new opportunities arise. Feel free to reapply as you gain more experience, but please avoid applying more than once every six months. We may also post specific roles for individual projects or team needs, in which case you are welcome to apply directly alongside this evergreen role.Key ResponsibilitiesDesign, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for extensive training workloads.Develop high-performance optimizations to maximize throughput and efficiency.Create reusable frameworks and libraries that enhance training reproducibility, reliability, and scalability for new model architectures.Establish standards for reliability, maintainability, and security, ensuring systems remain robust under rapid iterations.Collaborate with researchers and engineers to construct scalable infrastructure.Publish and disseminate findings through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Software Engineer, Data Infrastructure - Research

OpenAI

Full-time|On-site|San Francisco

About Our Innovative TeamJoin the Workload team at OpenAI, where we are at the forefront of designing and managing the cutting-edge infrastructure that drives the training and inference of large language models (LLMs) at an unprecedented scale. Our systems are engineered to harmonize the complex processes of model training and serving, abstracting performance, parallelism, and execution across extensive GPU and accelerator networks. This robust foundation allows researchers to concentrate on elevating model capabilities, while we take care of the scalability, efficiency, and reliability needed to bring these advanced models to life.Your Role and ResponsibilitiesWe are seeking a talented engineer to design and implement the dataset infrastructure that will fuel OpenAI’s next-generation training stack. Your primary focus will be on creating standardized dataset interfaces, scaling pipelines across thousands of GPUs, and proactively identifying and addressing performance bottlenecks. Collaboration with multimodal researchers and infrastructure teams will be key to ensuring that our datasets are unified, efficient, and user-friendly.Key Responsibilities Include:Design and maintain standardized dataset APIs, including those for multimodal (MM) data that exceeds memory capacity.Develop proactive testing and validation pipelines for dataset loading at GPU scale.Work collaboratively to integrate datasets into training and inference pipelines, ensuring seamless user experiences.Document and maintain dataset interfaces to ensure they are discoverable, consistent, and easily adoptable by other teams.Establish validation systems to assure datasets remain reproducible and unchanged once standardized.Identify and troubleshoot performance bottlenecks in distributed dataset loading, such as stragglers impacting global training speed.Create visualization and inspection tools to highlight errors, bugs, or bottlenecks in datasets.Ideal Candidate ProfilePossess strong engineering fundamentals and experience in distributed systems, data pipelines, or infrastructure.Have a proven track record in building APIs, modular code, and scalable abstractions, with a user-centric approach to design.Be adept at debugging performance issues across large-scale machine fleets.Demonstrate a passion for advancing data infrastructure to enhance research capabilities.

Sep 18, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.What You’ll DoDesign and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.Contribute to the design of our internal orchestration and monitoring frameworks.

Nov 27, 2025

Apply

Research Engineer, Reinforcement Learning Infrastructure and Reliability

Anthropic

Full-time|On-site|San Francisco, CA

Anthropic is hiring a Research Engineer focused on Reinforcement Learning Infrastructure and Reliability. This role is based in San Francisco, CA. Role overview This position centers on building and maintaining systems essential to AI research. The work supports Anthropic’s reinforcement learning efforts, with an emphasis on infrastructure stability and performance. What you will do Collaborate with a team of specialists to develop and support key systems for AI research. Improve the reliability and efficiency of infrastructure supporting reinforcement learning projects. Apply technical expertise to advance Anthropic’s AI capabilities. Team environment Work alongside engineers and researchers dedicated to advancing AI reliability and performance. The team values collaboration and aims to enable new research while maintaining the stability of Anthropic’s core systems.

Apr 23, 2026

Apply

Infrastructure Research Engineer - Reinforcement Learning Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We're dedicated to crafting a future where everyone can harness the power of AI to meet their unique needs and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most widely utilized AI products, including ChatGPT and Character.ai, as well as open-weight models like Mistral, in addition to renowned open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented Infrastructure Research Engineer to architect and develop the foundational systems that facilitate the scalable and efficient training of large models using reinforcement learning.This position exists at the crossroads of research and large-scale systems engineering, requiring a professional who not only comprehends the algorithms behind reinforcement learning but also appreciates the practicalities of distributed training and inference at scale. You will have a diverse set of responsibilities, from optimizing rollout and reward pipelines to enhancing the reliability, observability, and orchestration of systems. Collaboration with researchers and infrastructure teams will be essential to ensure reinforcement learning is stable, rapid, and production-ready.Note: This is an evergreen role that we maintain on an ongoing basis to express interest. Due to the high volume of applications we receive, there may not always be an immediate position that aligns perfectly with your skills and experience. We encourage you to apply, as we continuously review applications and reach out to candidates when new opportunities arise. You may reapply after gaining more experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles that cater to unique project or team needs; in those circumstances, you are welcome to apply directly alongside this evergreen role.What You’ll DoDesign, implement, and optimize the infrastructure that supports large-scale reinforcement learning and post-training workloads.Enhance the reliability and scalability of the RL training pipeline, including distributed RL workloads and training throughput.Create shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility of RL systems.Work closely with researchers to translate algorithmic concepts into production-quality training pipelines.Develop evaluation and benchmarking infrastructure to assess model performance based on helpfulness, safety, and factual accuracy.Publish and disseminate insights through internal documentation, open-source libraries, or technical reports that contribute to the advancement of scalable AI infrastructure.

Nov 27, 2025

Apply

Infrastructure Research Engineer - Kernels at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our ambition is to enhance human potential by advancing collaborative general intelligence. We envision a future where individuals have the tools and knowledge to harness AI for their distinct requirements and aspirations.Our team comprises dedicated scientists, engineers, and innovators who have contributed to some of the most renowned AI products, including ChatGPT and Character.ai, along with open-weight models like Mistral, and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Infrastructure Research Engineer to architect, optimize, and sustain the computational frameworks that facilitate large-scale language model training. You will create high-performance machine learning kernels (e.g., CUDA, CuTe, Triton), enable effective low-precision arithmetic operations, and enhance the distributed computing infrastructure essential for training expansive models.This position is ideal for an engineer who thrives in close collaboration with hardware and research disciplines. You will partner with researchers and systems architects to merge algorithmic design with hardware efficiency. Your responsibilities will include prototyping new kernel implementations, evaluating performance across various hardware generations, and helping to establish the numerical and parallelism strategies crucial for scaling next-generation AI systems.Note: This is an evergreen role that remains open continuously for expressions of interest. We receive numerous applications, and there may not always be an immediate opportunity that aligns with your qualifications. However, we encourage you to apply, as we regularly assess applications and will reach out as new positions become available. You are also welcome to reapply after gaining additional experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles catering to particular projects or team needs. In such cases, you are encouraged to apply directly alongside this evergreen listing.What You’ll DoDesign and develop custom ML kernels (e.g., CUDA, CuTe, Triton) for key LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for contemporary GPU and accelerator architectures.Conceptualize compute primitives aimed at alleviating memory bandwidth bottlenecks and enhancing kernel compute efficiency.Collaborate with research teams to synchronize kernel-level optimizations with model architecture and algorithmic objectives.Create and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.Contribute to the stability and scalability of our infrastructure, ensuring it meets the growing demands of AI development.

Nov 27, 2025

Apply

Software Engineer, Frontier Clusters Infrastructure

OpenAI

Full-time|On-site|San Francisco

About the TeamJoin the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.About the RoleWe are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.In this role, you will:Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle managementCreate software abstractions that integrate multiple clusters and provide a cohesive interface for training workloadsOversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scaleEnhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cyclesIntegrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructureDevelop monitoring and observability systems to identify issues early and maintain cluster stability under high loadsYou might thrive in this role if you:Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environmentsPossess strong programming skills in languages relevant to cloud and infrastructure management

Nov 7, 2024

Apply

ML Infrastructure Engineer

Sygaldry Technologies

Full-time|On-site|San Francisco

About Sygaldry Technologies Sygaldry Technologies develops quantum-accelerated AI servers in San Francisco, focusing on faster AI training and inference. By combining quantum technology with artificial intelligence, the team addresses challenges in computing costs and energy efficiency. Their AI servers integrate multiple qubit types within a fault-tolerant system, aiming for a balance of cost, scalability, and speed. The company values optimism, rigor, and a drive to solve complex problems in physics, engineering, and AI. Role Overview: ML Infrastructure Engineer The ML Infrastructure Engineer joins the AI & Algorithms team, which includes research scientists, applied mathematicians, and quantum algorithm specialists. This role centers on building and maintaining the compute infrastructure that powers advanced research. The systems you build will support reliable GPU access, reproducible experiments, and scalable workloads, so researchers can focus on their core work without needing deep cloud expertise. Expect to design and manage compute platforms for a range of tasks, including quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation. These workloads span multiple cloud providers and on-premises GPU servers. Key Responsibilities Develop compute abstractions for diverse workloads, such as GPU-accelerated simulations, distributed training, high-throughput CPU jobs, and interactive analyses using frameworks like PyTorch and JAX. Set up infrastructure to support experiment tracking and reproducibility. Create developer tools that make cloud computing feel local, streamlining environment setup, job submission, monitoring, and artifact management. Scale experiments from single-GPU prototypes to large, multi-node production runs. Multi-Cloud GPU Orchestration Design orchestration strategies for workloads across multiple cloud providers, optimizing job routing for cost, availability, and capability. Monitor and improve cloud spending, keeping track of credit balances, burn rates, and expiration dates.

Apr 14, 2026

Apply

Research Engineer - Reinforcement Learning Infrastructure

primeintellect

Full-time|On-site|San Francisco

Join primeintellect as a Research Engineer focused on Reinforcement Learning Infrastructure. In this role, you will be instrumental in advancing our cutting-edge AI technologies. You will collaborate with interdisciplinary teams to develop robust frameworks that enhance machine learning capabilities and drive innovation.As a key player in our engineering team, you will work on designing, implementing, and optimizing systems that support reinforcement learning algorithms. Your contributions will directly impact the efficiency and effectiveness of our AI solutions.

Mar 27, 2026

Apply

Software Engineer in Compute Infrastructure

OpenAI

Full-time|On-site|San Francisco

Team and Platform Focus The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company. Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests. Role Overview This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence. What You Will Work On Working close to hardware or at the user interaction layer Developing CaaS and agent infrastructure Managing control and data planes that connect the system Bringing new supercomputing capabilities online Optimizing training workloads through profiler traces and benchmarks Improving NCCL and collective communication Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes Designing abstractions to unify diverse clusters into a single platform Areas of Expertise No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.

Apr 27, 2026

Apply

Infrastructure Engineer, Sandboxing

Anthropic

On-site|On-site|San Francisco, CA | New York City, NY | Seattle, WA

Join Anthropic as an Infrastructure Engineer on our Sandboxing team, where you'll play a pivotal role in building and scaling secure execution environments for AI research. Your expertise will ensure that researchers can safely experiment with AI-generated code in isolated settings. As our models advance, the infrastructure that supports these environments becomes increasingly vital. Your contributions will help maintain security and reliability at scale, directly aligning with our mission to develop trustworthy and beneficial AI systems.

Jan 29, 2026

Apply

Research Engineer in Economic Research

Anthropic

Full-time|On-site|San Francisco, CA

Join Anthropic as a Research Engineer focusing on Economic Research. In this role, you will leverage your analytical skills to conduct in-depth economic analysis and contribute to innovative projects aimed at enhancing our understanding of economic models and their implications.

Mar 12, 2026

Apply

Research Engineer / Research Scientist, Post-Training

OpenAI

Full-time|Hybrid|San Francisco

About the TeamJoin the innovative Post-Training team at OpenAI, where we focus on refining and elevating pre-trained models for deployment in ChatGPT, our API, and future products. Collaborating closely with various research and product teams, we conduct crucial research that prepares our models for real-world deployment to millions of users, ensuring they are safe, efficient, and reliable.About the RoleAs a Research Engineer / Scientist, you will spearhead the research and development of enhancements to our models. Our work intersects reinforcement learning and product development, aiming to create cutting-edge solutions.We seek passionate individuals with robust machine learning engineering skills and research experience, particularly with innovative and powerful models. The ideal candidate will be driven by a commitment to product-oriented research.This position is located in San Francisco, CA, and follows a hybrid work model requiring three days in the office each week. Relocation assistance is available for new employees.In this role, you will:Lead and execute a research agenda aimed at enhancing model capabilities and performance.Work collaboratively with research and product teams to empower customers to optimize their models.Develop robust evaluation frameworks to monitor and assess modeling advancements.Design, implement, test, and debug code across our research stack.You may excel in this role if you:Possess a deep understanding of machine learning and its applications.Have experience with relevant models and methodologies for evaluating model improvements.Are adept at navigating large ML codebases for debugging purposes.Thrive in a fast-paced and technically intricate environment.About OpenAIOpenAI is a pioneering AI research and deployment organization dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We are committed to pushing the boundaries of AI capabilities while prioritizing safety and human-centric values in our products. Our mission is to embrace diverse perspectives, voices, and experiences that represent the full spectrum of humanity, as we strive for a future where AI is a powerful ally for everyone.

Dec 1, 2025

Apply

Research Engineer/Research Scientist, RL/Reasoning

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamJoin the forefront of AI innovation with the RL and Reasoning team at OpenAI. Our team is dedicated to advancing reinforcement learning research and has pioneered transformative projects, including o1 and o3. We are committed to pushing the limits of generative models while ensuring their scalable deployment.About the RoleAs a Research Engineer/Research Scientist at OpenAI, you will play a pivotal role in enhancing AI alignment and capabilities through state-of-the-art reinforcement learning techniques. Your contributions will be essential in training intelligent, aligned, and versatile agents that power various AI models.We seek individuals with a solid foundation in reinforcement learning research, agile coding skills, and a passion for rapid iteration.This position is located in San Francisco, CA, and follows a hybrid work model of three days in the office per week. We also provide relocation assistance for new hires.You may excel in this role if:You are enthusiastic about being at the cutting edge of RL and language model research.You take initiative, owning ideas and driving them to fruition.You value principled methodologies, conducting simple experiments in controlled environments to draw trustworthy conclusions.You thrive in a fast-paced, complex technical environment where rapid iteration is essential.You are adept at navigating extensive ML codebases to troubleshoot and enhance them.You possess a profound understanding of machine learning and its applications.About OpenAIOpenAI is a pioneering AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves the greater good for humanity. We strive to push the boundaries of AI system capabilities while prioritizing safe deployment through our innovative products. We recognize AI as a powerful tool that must be developed with safety and human-centric principles, embracing diverse perspectives to reflect the full spectrum of humanity.We are proud to be an equal opportunity employer, welcoming applicants from all backgrounds without discrimination based on race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or any other legally protected characteristic.

May 14, 2025

Apply

Infrastructure Engineering Lead, IT

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Infrastructure Engineering team operates within the IT department, dedicated to the reliable construction, deployment, and management of critical on-premises and hybrid environments that empower our internal services and vital research and development projects.This newly established team is committed to implementing rigorous Site Reliability Engineering (SRE) practices in environments where uptime, safety, recoverability, and security are paramount. We aim to replace unique, one-off infrastructure with standardized infrastructure-as-code components that enhance reliability and operational efficiency as OpenAI continues to grow.About This RoleWe are in search of an Infrastructure Engineering Lead who will architect, build, and maintain reliable, secure, and scalable infrastructure that supports identity, access, endpoint, and shared platform services throughout the organization.You will take full ownership of infrastructure and identity systems from conceptual design and provisioning to policy enforcement, upgrades, recovery, and ongoing operations. Your goal will be to develop robust, production-grade platforms that minimize operational hurdles, enforce security by default, and empower teams to work more effectively and confidently.This position is ideal for a senior engineer who excels in navigating ambiguity, relishes the challenge of overseeing complex systems from start to finish, and enhances reliability and security by transforming fragile implementations into standardized, repeatable infrastructure.This role is based at our San Francisco headquarters and requires in-office attendance.Key Responsibilities:Define and refine infrastructure patterns for on-prem and hybrid environments, including self-hosted platforms, vendor-supported systems, and lab settings.Establish standardized, production-grade deployment and operational models that replace custom-built solutions.Collaborate with IT, Security, Identity, and Network teams to ensure infrastructure is designed to meet reliability, security, and access standards.Design and enhance the production architecture for Identity and Access Management (IAM) adjacent platforms, such as Microsoft Entra, utilizing SRE principles.Develop common management protocols and shared resources within Azure subscriptions to ensure uniformity and policy compliance in operations.

Jan 30, 2026

Create account — see all 5,626 results

1 - 20 of 5,626 Jobs

Select all on this page (20)

Apply

Research Engineer, Infrastructure

Cognition

Full-time|On-site|San Francisco Bay Area

Apr 8, 2026

Apply

Research Infrastructure Engineer for Training Systems

OpenAI

Full-time|On-site|San Francisco

Apr 27, 2026

Apply

Research Infrastructure Engineer at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Feb 3, 2026

Apply

Research Engineer, Infrastructure & Inference

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Research Engineer at thinkingmachines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Infrastructure Research Engineer - Training Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply

Software Engineer, Data Infrastructure - Research

OpenAI

Full-time|On-site|San Francisco

Sep 18, 2025

Apply

Infrastructure Research Engineer - Numerics at Thinking Machines | San Francisco

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our mission is to empower humanity by advancing collaborative general intelligence. We envision a future where everyone has access to the knowledge and tools necessary to make AI work for their individual needs and goals. Our team comprises scientists, engineers, and innovators who have developed some of the most widely adopted AI products, including ChatGPT and Character.ai, alongside open-weight models like Mistral, as well as popular open-source initiatives such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a highly skilled infrastructure research engineer to architect and develop core systems that facilitate efficient large-scale model training, with a strong emphasis on numerics. You will enhance the numerical foundations of our distributed training stack, focusing on precision formats, kernel optimizations, and communication frameworks to ensure that training trillion-parameter models is stable, scalable, and fast.This position is perfect for an individual who excels at the intersection of research and systems engineering—a creator who comprehends both the mathematics of optimization and the practicalities of distributed computing.Note: This is an "evergreen role" that remains open for ongoing expressions of interest. While we receive numerous applications and there may not always be an immediate opening that perfectly matches your skills and experience, we encourage you to apply. We continuously review applications and will contact applicants as new opportunities arise. You are welcome to reapply if you gain additional experience, but please refrain from applying more than once every six months. You may also notice postings for specific roles related to particular projects or teams; in those instances, you are welcome to apply for those positions in addition to the evergreen role.What You’ll DoDesign and optimize distributed training infrastructure for large-scale LLMs, ensuring performance, stability, and reproducibility in multi-GPU and multi-node environments.Implement and assess low-precision numerics (e.g., BF16, MXFP8, NVFP4) to enhance efficiency while maintaining model quality.Develop kernels and communication primitives that leverage hardware-level support for mixed and low-precision arithmetic.Collaborate with research teams to co-design model architectures and training methodologies that align with new numeric formats and stability requirements.Prototype and benchmark scaling strategies, including data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication.Contribute to the design of our internal orchestration and monitoring frameworks.

Nov 27, 2025

Apply

Research Engineer, Reinforcement Learning Infrastructure and Reliability

Anthropic

Full-time|On-site|San Francisco, CA

Apr 23, 2026

Apply

Infrastructure Research Engineer - Reinforcement Learning Systems

Thinking Machines Lab

Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Nov 27, 2025

Apply