Software Engineer, Frontier Clusters Infrastructure

OpenAISan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Proven expertise in operating and scaling Kubernetes clusters or equivalent container orchestration systems in large-scale environments. Strong programming skills in relevant languages such as Python, Go, or similar. Experience with bare-metal provisioning and management. Familiarity with networking and data center infrastructure. Excellent problem-solving skills and the ability to work in fast-paced environments.

About the job

About the Team

Join the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.

We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.

Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.

About the Role

We are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.

This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.

Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.

In this role, you will:

Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle management
Create software abstractions that integrate multiple clusters and provide a cohesive interface for training workloads
Oversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scale
Enhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cycles
Integrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to identify issues early and maintain cluster stability under high loads

You might thrive in this role if you:

Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environments
Possess strong programming skills in languages relevant to cloud and infrastructure management

About OpenAI

At OpenAI, we are at the forefront of artificial intelligence research, dedicated to advancing technology for the benefit of humanity. Our Frontier Systems team is pivotal in pushing the boundaries of what's possible with supercomputing, creating scalable and efficient systems that empower our groundbreaking AI models.

Similar jobs

1 - 20 of 5,700 Jobs

Search for Software Engineer Frontier Clusters Infrastructure

5,700 results

Select all on this page (20)

Apply

Software Engineer, Frontier Clusters Infrastructure

OpenAI

Full-time|On-site|San Francisco

About the TeamJoin the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.About the RoleWe are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.In this role, you will:Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle managementCreate software abstractions that integrate multiple clusters and provide a cohesive interface for training workloadsOversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scaleEnhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cyclesIntegrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructureDevelop monitoring and observability systems to identify issues early and maintain cluster stability under high loadsYou might thrive in this role if you:Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environmentsPossess strong programming skills in languages relevant to cloud and infrastructure management

Nov 7, 2024

Apply

Software Engineer, Frontier Systems

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technology, responsible for creating, deploying, and maintaining some of the world's largest supercomputers. These supercomputers are pivotal for training our most advanced AI models, pushing the boundaries of innovation.We transform sophisticated data center designs into operational systems and develop the software infrastructure necessary for extensive frontier model training. Our goal is to ensure these hyperscale supercomputers operate reliably and efficiently, supporting groundbreaking AI research.About the RoleAs a key member of the Frontier Systems team, you will be instrumental in designing the critical infrastructure that ensures our supercomputers function seamlessly for pioneering AI research. In this role, you'll address system-level challenges and implement automation solutions that minimize disruptions during large-scale training processes.Your responsibilities will encompass end-to-end ownership of your projects, allowing you to make significant contributions to our mission. This position is ideal for individuals who excel in diagnosing complex system issues and crafting automation strategies to proactively resolve problems across a vast network of machines.Your Responsibilities Include:Enhancing system health checks to maintain the stability of our hyperscale supercomputers during model training.Conducting in-depth investigations into hardware failures and system-level bugs to uncover root causes.Developing automation tools that monitor and resolve issues across thousands of systems, enabling uninterrupted research progress.You May Be a Great Fit If You Possess:7+ years of hands-on experience in software engineering.Strong proficiency in Python and shell scripting.Expertise in analyzing complex data sets using SQL, PromQL, Pandas, or other relevant tools.Experience in creating reproducible analyses.A solid balance of skills in both building and operationalizing systems.Prior experience with hardware is not a prerequisite for this role.Preferred Qualifications:Familiarity with the intricacies of hardware components, protocols, and Linux tools (e.g., PCIe, Infiniband, networking, power management, kernel performance tuning).Experience with system optimization and performance tuning.

May 9, 2025

Apply

Software Engineer, Frontier AI Infrastructure

Scale AI

Full-time|$138K/yr - $259.4K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Scale AI is on the lookout for an exceptionally talented and driven Software Engineer, Frontier AI Infrastructure to become an integral part of our innovative Public Sector Engineering team. In this role, you will take charge of the model inference layer, enabling cutting-edge AI models, troubleshooting the latest AI tools, managing networking tasks, addressing latency issues, and monitoring pricing and usage metrics for AI models. You will spearhead technical discussions with cloud vendors and clients to fulfill critical contracts and resolve platform challenges. Additionally, you will collaborate closely with Product teams to anticipate feature requirements, transitioning from reactive 'infra-only debugging' to proactive integration testing.Your Responsibilities Include:Designing and implementing secure, scalable backend systems tailored for Public Sector clients, utilizing Scale's advanced cloud-native AI infrastructure.Owning services or systems while defining long-term health objectives and enhancing the health of related components.Redesigning the architecture to operate in compliant or restrictive environments, which entails creating swappable components (authentication, storage, logging) to adhere to government and security regulations without compromising product integrity.Collaborating with Product teams to develop integration tests that identify issues early, shifting focus from 'infra-only debugging' to preventing upstream failures.Actively participating in customer engagements, liaising with stakeholders to comprehend requirements and deliver innovative solutions.Contributing to the platform roadmap and product strategy for Scale AI's Public Sector division, playing a vital role in shaping the future trajectory of our offerings.

Mar 26, 2026

Apply

Software Engineer, Infrastructure

Sierra

Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are revolutionizing the way businesses engage with their customers by building a cutting-edge platform that harnesses the power of AI. Our headquarters is located in the vibrant city of San Francisco, with additional offices expanding in Atlanta, New York, London, France, Singapore, and Japan.Our company culture is deeply rooted in our core values: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and foster an environment where innovation thrives.Sierra was co-founded by visionary leaders Bret Taylor, who currently serves as the Board Chair of OpenAI and has a rich history with Salesforce and Facebook, and Clay Bavor, who previously led Google Labs and spearheaded initiatives like Google Lens and Project Starline.Your RoleAs a Software Engineer focusing on Infrastructure at Sierra, you will play a pivotal role in designing, constructing, and maintaining the foundational systems that empower our AI platform. Your expertise will ensure that our infrastructure is not only secure and reliable but also scalable, allowing product teams to execute their work with agility and confidence.Guarantee the reliability, scalability, and performance of our platform and LLM inference serving in response to increasing traffic demands.Develop and oversee cloud infrastructure using Terraform to create secure, scalable, and reproducible environments.Establish and manage a self-service infrastructure platform to empower engineering teams in deploying and operating services independently.Take ownership of and improve CI/CD pipelines and release management processes, facilitating rapid and reliable deployments across Sierra’s platform.Design and manage distributed systems utilizing distributed databases, retrieval systems, and machine learning models.Develop and sustain core data serving abstractions along with essential authentication and security features (SSO, RBAC, authentication controls).Effectively navigate and integrate our technology stack with enterprise customer environments in a scalable and maintainable manner.

Oct 15, 2025

Apply

Infrastructure Software Engineer

Exa

Full-time|On-site|San Francisco, California

At Exa, we are on a mission to create a cutting-edge search engine from the ground up, designed to cater to the diverse needs of AI applications. Our team is building a robust infrastructure that enables us to crawl the internet, train advanced embedding models for indexing, and develop high-performance vector databases using Rust. Additionally, we manage a significant $5M H200 GPU cluster that powers tens of thousands of machines.The Infrastructure Team at Exa is responsible for developing the essential tools and infrastructure that support our entire system. We are looking for talented infrastructure engineers to help us scale our capabilities rapidly. Your work could involve orchestrating GPU clusters with Kubernetes, implementing map-reduce batch jobs on Ray, or creating top-tier observability tools that set industry standards.

Sep 3, 2025

Apply

Senior Software Engineer, Infrastructure

Serval

Full-time|On-site|San Francisco

Who We AreServal is an innovative AI-driven automation platform redefining operational efficiency for enterprises. Our intelligent agents seamlessly comprehend and execute real-world workflows, replacing outdated manual processes with adaptive, self-learning software. Since our inception in early 2024, we have garnered the trust of industry leaders such as General Motors, Notion, Perplexity, Vercel, Mercor, LangChain, and Verkada, streamlining high-volume operational tasks across their organizations.At the heart of Serval is a cutting-edge agentic AI platform that transforms natural language into actionable workflows. Our agents not only respond to queries but also reason, act across various systems, and continuously enhance their performance. What started as a solution for operational tasks has rapidly expanded into a versatile AI automation layer utilized across IT, HR, Finance, Security, Legal, and Engineering sectors.Our mission is to eradicate repetitive, manual tasks within enterprises, empowering teams through intelligent automation. In the long run, we aim to establish a universal AI operations layer—a system of agents that integrates across business functions, maintaining the momentum of modern companies.We are proud to be backed by renowned investors including Sequoia Capital, Redpoint Ventures, Meritech, First Round, General Catalyst, and Elad Gil, and founded by seasoned product and engineering leaders from Verkada.Role OverviewAs a Senior Software Engineer in Infrastructure at Serval, you will be pivotal in developing and scaling the core systems that empower our AI agents and workflow automation platform. A crucial aspect of this role involves enabling and supporting self-hosted deployments for enterprise clients needing on-premises or private cloud environments. We are looking for engineers with profound expertise in distributed systems, infrastructure-as-code, production operations, and customer-facing support, who aspire to influence the technical architecture of a rapidly evolving platform.What You'll DoDesign, implement, and operate large-scale distributed systems that power Serval's AI agents, workflow orchestration, and data pipelines.Create and maintain Terraform modules to provision and manage cloud infrastructure across AWS, GCP, or Azure environments.Develop and sustain deployment packages, installation scripts, and infrastructure templates, enabling customers to self-host Serval in their own environments.Provide technical support and guidance to enterprise customers during installation and deployment phases.

Jan 29, 2026

Apply

Software Engineer, Infrastructure

Imprint

Full-time|On-site|San Francisco

About UsAt Imprint, we are revolutionizing the world of co-branded credit cards and innovative financial solutions, focusing on smarter, more rewarding, and brand-first experiences. We collaborate with renowned brands such as Crate & Barrel, Rakuten, Booking.com, H-E-B, Fetch, and Brooks Brothers to establish modern credit programs that enhance customer loyalty, unlock savings, and stimulate growth. Our robust platform integrates advanced payment technologies, intelligent underwriting, and a seamless user experience, enabling brands to offer impactful financial products without the complexities of becoming a bank.Co-branded credit cards represent over $300 billion in U.S. annual spending, yet many are still managed by outdated banking systems. Imprint stands as the modern alternative—flexible, technology-driven, and tailored for today’s consumers. Supported by notable investors like Kleiner Perkins, Thrive Capital, and Khosla Ventures, we are assembling a world-class team dedicated to reshaping payment methods and driving brand growth. If you thrive in fast-paced environments, enjoy tackling complex challenges, and aspire to make a significant impact, we would be delighted to meet you.Discover more about us on Imprint's Technology Blog.The TeamThe Tech Platform Engineering Team at Imprint is pioneering the democratization of access to advanced technologies, empowering teams across our organization to innovate and excel. Our commitment to redefining the Fintech landscape drives us to build secure, highly available infrastructures while equipping our engineers with comprehensive development tools, allowing them to rapidly create world-class products.Your RoleDesign, build, and manage cloud and web infrastructure with a strong emphasis on security, reliability, and scalability.Implement and maintain infrastructure components across computing, networking, and data platforms.Adhere to security best practices in cloud infrastructure, ensuring proper access control, network isolation, and secure communication between services.Monitor system health and engage in incident response, root cause analysis, and reliability enhancements.Collaborate with platform, security, and product engineers to deliver safe and efficient infrastructure solutions.

Jan 16, 2026

Apply

Backend & Infrastructure Software Engineer

vooma

Full-time|On-site|San Francisco Office

About the RoleJoin our pioneering team at vooma as a Backend & Infrastructure Software Engineer, where you will play a critical role in shaping the technical infrastructure of a transformative company.If you are passionate about creating not only resilient systems but also the foundational architecture of a groundbreaking enterprise from the outset, this position is ideal for you.We are looking for someone who excels at crafting infrastructure that is elegant, dependable, and secure, even under high-demand scenarios. You thrive on the challenge of scaling systems that enable intelligent agents and take pride in establishing reliable foundations that others can rely on.Your Key Responsibilities Include:Design and maintain secure, scalable infrastructure tailored for AI-powered agents in production environments.Deploy and optimize AI-driven services to meet high availability and performance standards.Manage infrastructure as code, alongside cloud environments and CI/CD pipelines.Implement monitoring, observability, and alerting systems to ensure the reliability of our infrastructure.Contribute to infrastructure security and adhere to best practices.You Should Have:Experience in deploying and productionizing machine learning or AI-centric workloads.Proficiency in developing secure, scalable infrastructures on platforms such as AWS, Azure, or GCP.In-depth knowledge of backend systems, networking, and container orchestration technologies (e.g., Kubernetes).Understanding of infrastructure security principles and compliance standards (e.g., SOC2).A proactive and hands-on mindset, with a strong drive to solve challenges from start to finish.

Jul 1, 2025

Apply

Software Engineer - Infrastructure

Baseten

Full-time|$300K/yr - $300K/yr|On-site|San Francisco

ABOUT BASETENJoin Baseten, where we drive mission-critical AI inference for leading companies like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. Our unique blend of applied AI research, robust infrastructure, and intuitive developer tools empowers organizations at the forefront of AI innovation to deploy state-of-the-art models into production. Recently, we secured a $300M Series E funding round, backed by esteemed investors such as BOND, IVP, Spark Capital, Greylock, and Conviction. Be a part of our rapid growth and help shape the platform that engineers trust for launching AI products.THE ROLEAs an Infrastructure Software Engineer at Baseten, you will play a pivotal role in developing and maintaining our ML inference platform that powers AI applications in production. Your contributions will enhance the core infrastructure, enabling developers to deploy, scale, and monitor machine learning models with exceptional performance.EXAMPLE INITIATIVESYou will engage in innovative projects within our Infrastructure team, including:Multi-cloud capacity managementInference on B200 GPUsMulti-node inferenceFractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and develop infrastructure components for our ML inference platform, primarily using Python and Go.Implement and maintain Kubernetes deployments for optimal model serving.Contribute to the orchestration layer for model deployments.Build and enhance monitoring systems to track model performance metrics effectively.Develop efficient resource management solutions to optimize performance.

Mar 9, 2025

Apply

Infrastructure Software Engineer

Sift

Full-time|$150K/yr - $200K/yr|On-site|San Francisco, CA

At Sift, we are revolutionizing the way cutting-edge machines are constructed, tested, and managed. Our innovative platform provides engineers with real-time visibility into high-frequency telemetry, effectively removing bottlenecks and facilitating quicker, more dependable development.Sift originated from our experience at SpaceX, contributing to projects like Dragon, Falcon, Starlink, and Starship, where the demands of scaling telemetry, debugging flight systems, and ensuring mission reliability necessitated a new kind of infrastructure. Founded by a talented team from SpaceX, Google, and Palantir, Sift is tailored for mission-critical systems where precision and scalability are imperative.As one of the pioneering engineers at Sift, your role will extend beyond just coding—you will play a crucial part in defining the architecture, shaping the product, and influencing the culture of a company dedicated to addressing real engineering challenges. If you're eager to take on intricate technical obstacles and build foundational systems that support complex machines from the ground up, we would love to connect with you.

Oct 30, 2025

Apply

Infrastructure Software Engineer

Ivo

Full-time|On-site|San Francisco, California

Join Ivo's Engineering Team!At Ivo, we are pioneers in the tech industry. Our engineers are innovators who have created groundbreaking solutions such as:• An AI agent that seamlessly integrates with MS Word to enhance document editing [2023]• Revolutionizing embedding models with agentic RAG technology [2023]• Advanced LLM-based legal fact extraction capabilities [2024]• A legal assistant designed to search extensive contract databases without compromising accuracy [2024]• Clustering legal documents from the same lineage [2025]• Automatic deviation analysis to uncover hidden risks in vast contract databases [2025]• Merging contracts with their amendments to create a “composite” contract timeline that has moved our clients to tears [2025]Role OverviewAs an Infrastructure Engineer at Ivo, you will lay the groundwork for our platform's future. Your responsibilities will include:• Designing and owning the future of our infrastructure, allowing you the freedom to innovate.• Managing multiple customer deployments, ensuring each receives tailored containers, databases, and VPCs.• Instrumenting our systems to identify performance bottlenecks and errors.• Aggregating metrics and logs into visually appealing dashboards and setting up pager alerts.• Leading infrastructure-related incidents and being on-call as necessary.• Enhancing our CI/CD system to reduce deployment time from ~12 minutes.If you're passionate about LLMs, you'll thrive in our engineering team, where you’ll have the opportunity to:• Develop real-time LLM evaluations to monitor the accuracy of our responses.• Collaborate with talented engineers to push the boundaries of DevOps.

Nov 20, 2025

Apply

Software Engineer - Infrastructure

Astranis

Full-time|On-site|San Francisco

Astranis is seeking a talented and motivated Software Engineer to join our Infrastructure team. In this role, you will be at the forefront of developing and maintaining critical software systems that support our innovative satellite technology. You'll collaborate with cross-functional teams to design, implement, and optimize our infrastructure solutions, ensuring high reliability and performance.

Apr 9, 2026

Apply

Infrastructure Software Engineer

Ivo Inc.

Full-time|On-site|San Francisco

About Engineering at Ivo Inc. Ivo Inc. builds advanced legal technology from its San Francisco base. The engineering team has delivered several notable products, including: An AI agent for Microsoft Word that edits documents automatically (2023) Migration from traditional embedding models to agentic RAG methods (2023) Large-scale legal fact extraction powered by LLMs (2024) A legal assistant designed to search large contract databases with precision (2024) Clustering related legal documents to improve organization (2025) Automated deviation analysis to surface hidden risks in contract data (2025) Combining contracts and amendments to create comprehensive contract time series (2025) Role Overview: Infrastructure Software Engineer The Infrastructure Software Engineer will help shape the core systems that power Ivo's platform. This role offers the chance to architect, optimize, and maintain the infrastructure supporting sensitive client data and high-performance legal applications. What You Will Do Own and influence the evolution of Ivo's infrastructure, with significant freedom to design systems due to a lean operational footprint. Orchestrate customer deployments, managing containers, databases, and VPCs for each client to ensure data isolation and security. Implement instrumentation to surface performance bottlenecks and errors across the stack. Aggregate metrics, logs, and health checks into dashboards and alerting systems for clear visibility. Participate in on-call rotations to lead and resolve infrastructure incidents. Optimize CI/CD pipelines to reduce deployment times (current average: 12 minutes). Opportunities to Advance DevOps and LLM Integration Develop real-time LLM evaluations to track output accuracy. Create autonomous agents that identify and troubleshoot production issues proactively. Bring forward new ideas to improve infrastructure and operations. Mission Ivo's mission is to empower clients with advanced legal technology that boosts efficiency and accuracy.

Apr 14, 2026

Apply

Software Engineer, Cloud Infrastructure

OpenAI

Full-time|On-site|San Francisco

Join Our Innovative TeamThe Applied Engineering team at OpenAI is dedicated to bridging the gap between research, engineering, product, and design, delivering cutting-edge AI technology to consumers and businesses alike.As a pivotal member of our team, you will manage the core infrastructure that underpins products such as ChatGPT and our API. This includes overseeing our Kubernetes clusters, infrastructure deployment, networking stack, cloud abstractions, and more.Our mission is to learn from our deployments and ensure the responsible and safe use of AI technology. We place a higher priority on safety than on unchecked growth.About Your RoleAs a vital contributor to the cloud infrastructure team, you'll be responsible for constructing and maintaining infrastructure abstractions that facilitate swift and scalable product delivery.This position is based in our San Francisco, CA office.Your Responsibilities:Architect and develop robust development and production platforms that ensure reliability and security at scale.Optimize our infrastructure for scalability to meet future demands.Foster a diverse, equitable, and inclusive work culture that encourages open communication and challenges conventional thinking.Participate in an on-call rotation to maintain the reliability of the systems we build and respond to critical incidents as necessary.You Will Excel in This Position If You:Possess over 5 years of experience in building core infrastructure.Have extensive experience with orchestration systems such as Kubernetes at scale.Are skilled in creating abstractions over cloud platforms.Take pride in developing and managing scalable, reliable, and secure systems.Thrive in environments characterized by ambiguity and rapid change.This role is exclusively located at our San Francisco headquarters. We offer relocation assistance to qualified candidates.

Aug 4, 2025

Apply

Product Infrastructure Software Engineer

Netic

Full-time|On-site|San Francisco

Netic is revolutionizing the essential services sector with our AI-driven revenue engine, empowering the backbone of the American economy.With $43M in funding from leading investors such as Founders Fund, Greylock, Hanabi, and Dylan Field, who spearheaded our Series B, we have enabled our clients to secure hundreds of thousands of jobs across various service industries in North America. Today, numerous companies thrive entirely on an AI-first model powered by Netic.As a member of our team consisting of innovative builders from top organizations such as Scale, Databricks, HRT, Meta, MIT, Stanford, and Harvard, you will be at the forefront of integrating frontier AI into the physical economy, where challenges are complex, data is intricate, and impacts are immediate and substantial.In the role of a founding Product Infrastructure Engineer, you will design and scale the crucial infrastructure that supports our autonomous AI agents, addressing real-world challenges with significant, tangible outcomes. You will work alongside a passionate team of builders to develop infrastructure and processes from scratch, utilizing state-of-the-art cloud and orchestration technologies. If you excel in dynamic, ambiguous settings and are eager to set new benchmarks in the agentic domain, this is your chance to make a lasting impact.

May 30, 2025

Apply

Infrastructure Software Engineer

ChaiDiscovery

Full-time|On-site|San Francisco office

About Chai DiscoveryChai Discovery specializes in developing cutting-edge AI models that revolutionize molecular design and redefine drug discovery processes. Our passionate team is dedicated to transforming the search for new cures and improving lives.Our founding team comprises top researchers and Silicon Valley experts, having achieved significant milestones in AI for biology. With a history of co-inventing protein language modeling and creating advanced folding algorithms, our technology has been embraced by leading pharmaceutical companies. We are proud to be supported by prestigious investors including OpenAI, Thrive Capital, Dimension, Conviction, Lachy Groom, Amplify, and others.About the RoleWe are seeking a dedicated Infrastructure Software Engineer focused on crafting robust, streamlined infrastructure solutions. You will develop the foundational compute and infrastructure systems that support our product offerings, model inference processes, and evaluation frameworks. Collaboration with product engineers, researchers, and our commercial team will be key to your success.You will have experience creating services that developers appreciate, successfully deploying and scaling AI/ML systems in production, and effectively anticipating potential challenges that may hinder the adoption of our platform by leading biopharmaceutical organizations.As Chai's models advance from protein structure prediction into practical therapeutic engineering, this role presents a unique opportunity to bring state-of-the-art AI drug design models to market, working alongside a team that is both detail-oriented and optimistic about the future.About YouYou are motivated by a mission to establish the benchmark for impactful AI technology. We are looking for candidates who possess:Software Experience:A Bachelor’s degree or equivalent experience in Computer Science or a related field.5+ years of experience in building production systems utilizing contemporary tools, collaborating with platform, security, and product teams.A keen ability to foresee infrastructure challenges.Comprehensive ownership of 24/7 infrastructure observability, alerting, and incident response.Experience in both 0-to-1 buildouts and 1-to-n scale-ups, along with a rich repository of best practices and strategies.Communication & Collaboration:A passion for code pair-reviewing, documentation, and knowledge sharing with peers.

Nov 25, 2025

Apply

Infrastructure Software Engineer

Blockit

Full-time|On-site|San Francisco

About BlockitAt Blockit, we recognize that time is our most precious resource, yet the art of scheduling often feels antiquated. Our mission is to revolutionize this process through advanced AI technology that acts as an autonomous time agent, adeptly managing the complexities of scheduling—including time zones, group coordination, and logistical considerations—as though it were an ever-vigilant executive assistant.As pioneers in the AI space, Blockit is at the forefront of developing one of the first multiplayer, stateful AI agents capable of facilitating interactions among multiple users, maintaining contextual continuity across conversations, and executing real-world actions. The more users integrate their calendars, the more robust our network becomes.Join our dynamic team, supported by Sequoia, where we maintain a fast-paced environment, consistently ship innovative solutions, and uphold high standards of excellence. If you’re excited about building groundbreaking technology, we would love to connect with you.To explore our team culture further, please visit our team page.The RoleIn this role, you will ensure that Blockit remains fast, reliable, and primed for scalability.You will take ownership of our core infrastructure, which includes databases, asynchronous job processing, observability, and the systems that drive our AI agents, including the LLM gateway. You will architect solutions as we expand, whether that means integrating new systems or innovating entirely new approaches. Furthermore, you will be the go-to person for reliability and performance, ensuring our systems remain robust as usage increases.This position is perfect for someone who is passionate about operational excellence and eager to lay the groundwork for a platform that orchestrates millions of calendars.What You’ll DoManage and evolve our core infrastructure, including PostgreSQL, Clickhouse, and asynchronous processing pipelines.Design and optimize our LLM infrastructure, which encompasses the LLM gateway, evaluation pipelines, and observability stack, to guarantee reliability, performance, and cost-effectiveness.Develop comprehensive monitoring, alerting, and dashboard solutions to promptly identify issues.Architect and implement new infrastructure as we scale, such as Redis, Kafka, or similar systems, making informed trade-offs along the way.Enhance deployment pipelines and developer experiences to maintain rapid and safe shipping of updates.

Jan 21, 2026

Apply

Infrastructure Software Engineer

xdof

Full-time|Hybrid|San Francisco Hybrid

Join xdof, where innovation meets opportunity! As we stand at the forefront of robotics and AI technology, we are dedicated to addressing the critical need for high-quality training data. Our mission is to develop sophisticated data collection systems, operational capabilities, and expansive data warehouses that empower our partners to lead the field.As an Infrastructure Engineer, you will be instrumental in creating a robust platform that supports our growing data collection initiatives.Key projects you may work on include:Developing an orchestration system for processing data upon ingestion.Designing an internal platform that allows researchers to experiment with our datasets.Managing a multi-tenant data lake to enhance data accessibility and collaboration.

Dec 10, 2025

Apply

Infrastructure Software Engineer

doppel

Full-time|On-site|San Francisco

Why Join Doppel?At Doppel, we are dedicated to tackling one of the most significant threats posed by AI: mass-manufactured social engineering. With scams, deepfakes, and social engineering attacks proliferating across digital platforms such as websites, social media, advertisements, encrypted messaging apps, and mobile devices, our mission is both simple and ambitious: to enhance internet safety by outsmarting the fastest-evolving digital threats.Supported by renowned investors like a16z and Bessemer, and trusted by industry leaders such as OpenAI, United Airlines, and Coinbase, Doppel is on a rapid growth trajectory. If you are passionate about addressing real-world challenges through innovative technology, we want to hear from you!What We're BuildingWe are developing an AI-driven platform to combat social engineering on a large scale. This involves creating scalable systems that monitor billions of domains, social media accounts, applications, and dark web forums, utilizing AI agents to detect and neutralize digital threats effectively.What We're Looking ForWe are in search of a skilled backend engineer to enhance the infrastructure needed for our rapidly expanding engineering team. Recent projects include:Developed a self-hosted Elasticsearch infrastructure on Kubernetes, facilitating real-time search capabilities across millions of alerts and associated metadata.Established core infrastructure using Terraform (Infrastructure as Code), enabling reproducible, version-controlled environments and expediting onboarding for new engineers.Implemented a dedicated staging environment, which enhances safety during releases, feature validation, and automated integration testing prior to production deployments.Introduced observability and tracing mechanisms (metrics, logging, distributed tracing), significantly improving our capacity to debug performance issues and sustain reliability at scale.What We Offer A mission-driven culture emphasizing low ego, high accountability, deep customer focus, and exceptional talent density. Complimentary lunch and dinner in the office. Flexible Paid Time Off (PTO). Quarterly team offsites.

Sep 12, 2025

Apply

Infrastructure Software Engineer

cognition

Full-time|On-site|San Francisco Bay Area

About the Role cognition is looking for an Infrastructure Software Engineer in the San Francisco Bay Area. This role focuses on designing, building, and maintaining infrastructure that supports high-performance systems. Collaboration with teams across the company is central to the work. What You Will Do Develop and maintain scalable infrastructure solutions Work closely with colleagues from different disciplines to support application needs Help ensure systems remain reliable and efficient as they grow

Apr 16, 2026

Create account — see all 5,700 results