Engineering Manager Compute Infrastructure jobs in San Francisco – Browse 8,358 openings on RoboApply Jobs

Engineering Manager - Compute Infrastructure

DatabricksMountain View, California; San Francisco, California

On-site Full-time $190K/yr - $253.8K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Manager

Qualifications

What we look for: 10+ years of experience in building and operating large-scale distributed systems, with a strong emphasis on testing, monitoring, and reliability. 3+ years of engineering management experience leading high-performing teams. In-depth knowledge of cloud infrastructure (AWS, Azure, GCP) and containerized environments (Kubernetes) is preferred. Demonstrated ability to collaborate effectively across engineering, product, and business stakeholders. Experience in scaling teams and systems through periods of rapid growth. BS, MS, or equivalent in a relevant field is preferred.

About the job

P-931

At Databricks, we are dedicated to empowering data teams to tackle some of the most challenging problems in the world—from revolutionizing transportation to fast-tracking medical innovations. We achieve this by developing and managing the foremost data and AI infrastructure platform, enabling our clients to leverage profound data insights to enhance their enterprises. Founded by engineers with a customer-centric approach, we seize every chance to resolve technical challenges, from crafting next-generation UI/UX for data interactions to scaling our services and infrastructure across millions of virtual machines. And we’re just getting started.

Within Databricks, the Compute Infrastructure organization is responsible for building and operating the essential framework that supports all Data, AI, and stateful workloads across major cloud platforms. Our system launches tens of millions of VMs daily, manages thousands of Kubernetes clusters, and must deliver exceptional elasticity, reliability, and cost-effectiveness.

We are in search of an Engineering Manager to lead a team focused on pivotal components of this platform. Your contributions will significantly impact product delivery speed, customer satisfaction, and our company's scalability.

The impact you will have:

Own and enhance the compute platform to support all Databricks workloads, enabling engineers to create top-tier products with high velocity and superior performance.
Recruit exceptional engineers and nurture their development through guidance, feedback, and career advancement opportunities.
Elevate the technical and operational standards through robust design practices, rigorous testing, and a culture of engineering excellence and platform thinking.
Collaborate with engineering and product leadership to establish long-term strategies and roadmaps.
Lead cross-functional initiatives encompassing both product and infrastructure domains.
Influence architectural decisions that extend beyond your immediate team.

About Databricks

Databricks is at the forefront of data and AI innovation, empowering organizations to harness the full potential of their data. Our cutting-edge platform is designed by engineers, for engineers, with a relentless focus on solving complex challenges and driving business success.

Similar jobs

1 - 20 of 8,358 Jobs

Select all on this page (20)

Apply

Engineering Manager - Compute Infrastructure

Databricks

Full-time|$190K/yr - $253.8K/yr|On-site|Mountain View, California; San Francisco, California

P-931 At Databricks, we are dedicated to empowering data teams to tackle some of the most challenging problems in the world—from revolutionizing transportation to fast-tracking medical innovations. We achieve this by developing and managing the foremost data and AI infrastructure platform, enabling our clients to leverage profound data insights to enhance their enterprises. Founded by engineers with a customer-centric approach, we seize every chance to resolve technical challenges, from crafting next-generation UI/UX for data interactions to scaling our services and infrastructure across millions of virtual machines. And we’re just getting started. Within Databricks, the Compute Infrastructure organization is responsible for building and operating the essential framework that supports all Data, AI, and stateful workloads across major cloud platforms. Our system launches tens of millions of VMs daily, manages thousands of Kubernetes clusters, and must deliver exceptional elasticity, reliability, and cost-effectiveness. We are in search of an Engineering Manager to lead a team focused on pivotal components of this platform. Your contributions will significantly impact product delivery speed, customer satisfaction, and our company's scalability. The impact you will have: Own and enhance the compute platform to support all Databricks workloads, enabling engineers to create top-tier products with high velocity and superior performance. Recruit exceptional engineers and nurture their development through guidance, feedback, and career advancement opportunities. Elevate the technical and operational standards through robust design practices, rigorous testing, and a culture of engineering excellence and platform thinking. Collaborate with engineering and product leadership to establish long-term strategies and roadmaps. Lead cross-functional initiatives encompassing both product and infrastructure domains. Influence architectural decisions that extend beyond your immediate team.

Feb 13, 2026

Apply

Infrastructure Manager - Compute Markets

Andromeda Cluster

Full-time|Remote|Global Remote / San Francisco, CA

Location: North America Remote / San Francisco · Full-TimeAbout AndromedaFounded by Nat Friedman and Daniel Gross, Andromeda Cluster is on a mission to democratize access to advanced AI infrastructure for early-stage startups. Initially starting with a single managed cluster, we rapidly expanded our capabilities to build a robust orchestration layer that enhances global AI infrastructure accessibility.We collaborate with prominent AI labs, data centers, and cloud providers to ensure compute resources are efficiently delivered where and when they are most required. Our innovative platform optimizes the routing of training and inference jobs globally, enhancing flexibility and operational efficiency in one of the most dynamic markets around.Our vision is to establish the liquidity layer for global AI compute, and we are continually seeking exceptional talent in AI infrastructure, research, and engineering.The OpportunityWe are in search of an Infrastructure Manager to enhance the alignment of supply and demand on our platform. This role is an Individual Contributor position, reporting directly to the Head of Infrastructure. The Infrastructure team forms the backbone of our operations, focusing on acquiring and managing compute resources in collaboration with our compute providers, sales, and technical teams.As we scale our operations, we aim to broaden our network and liquidity while deepening our service offerings and accelerating growth.What You'll Do• Align incoming leads from the sales team with both internal and external compute capacities.• Optimize the utilization of our compute resources.• Identify and onboard new compute suppliers globally.• Source capacity tailored to customer requirements and market trends.• Address customer and supplier challenges in a fast-paced, dynamic environment.• Analyze technical and commercial differences among suppliers to refine our capacity strategies.• Formulate a proactive compute strategy driven by market insights.• Negotiate costs with suppliers and vendors.• Design and implement capacity planning processes.

Mar 25, 2026

Apply

Software Engineer in Compute Infrastructure

OpenAI

Full-time|On-site|San Francisco

Team and Platform Focus The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company. Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests. Role Overview This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence. What You Will Work On Working close to hardware or at the user interaction layer Developing CaaS and agent infrastructure Managing control and data planes that connect the system Bringing new supercomputing capabilities online Optimizing training workloads through profiler traces and benchmarks Improving NCCL and collective communication Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes Designing abstractions to unify diverse clusters into a single platform Areas of Expertise No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.

Apr 27, 2026

Apply

Senior Software Engineer - Compute Infrastructure

Databricks

Full-time|On-site|San Francisco, California

Databricks is looking for a Senior Software Engineer focused on Compute Infrastructure in San Francisco, California. This position centers on building and improving compute architecture to support greater performance and scalability across Databricks' platform. What you will do Develop and optimize compute infrastructure to handle demanding data processing and analytics workloads. Work closely with teams from different disciplines to deliver reliable, high-quality solutions for customers. Impact Your contributions will help define how data processing and analytics evolve at Databricks. The work directly supports customers’ ability to scale and perform complex tasks in the cloud. Who we’re looking for Strong background in cloud technologies and compute systems. Enjoys tackling complex technical challenges. Collaborative approach to problem-solving with cross-functional teams.

Apr 28, 2026

Apply

Technical Program Manager, Compute Infrastructure

OpenAI

Full-time|Hybrid|San Francisco

About Our TeamThe Compute Infrastructure Team at OpenAI manages a robust fleet of GPUs and extensive compute clusters that support the models powering ChatGPT and our API. This team also accommodates the training demands for our upcoming models. We specialize in operating a state-of-the-art GPU fleet, offering a cohesive platform for various OpenAI teams to effortlessly execute production-level Applied AI and research training tasks.Our mission is to harness the potential of AI responsibly, ensuring its benefits are shared while prioritizing safety over unrestrained growth.Role OverviewAs a Technical Program Manager on our engineer-centric TPM team, you will take charge of the comprehensive delivery of large-scale GPU clusters, collaborating closely with engineers to initiate clusters across external providers and partners. You will manage a diverse portfolio that encompasses hardware, networking, power, and cooling—steering execution, risk management, and establishing clear alignment from operational teams to leadership, all aimed at delivering scalable, production-ready capacity.This position is located in San Francisco, CA, operating under a hybrid work model requiring three days in the office weekly. We also provide relocation assistance for new hires.Key ResponsibilitiesOversee the complete delivery of new Compute SKUs and large-scale GPU clusters within an external partner network while aiding capacity planning for both training and inference workloads.Drive multi-threaded program initiatives involving hardware, networking, power, and cooling—taking ownership of plans, interdependencies, and critical pathways.Collaborate with chip providers to mitigate risks associated with long-term onboarding to new hardware platforms, engaging with teams across kernels, communications, hardware, and scheduling.Develop and implement program mechanisms such as roadmaps, milestones, risk registers, and runbooks to ensure predictable delivery at scale.Work alongside engineering teams to enhance cluster turn-up reliability, repeatability, and automation, thereby decreasing the time-to-serve for new capacities.Facilitate cross-functional readiness involving security, finance, operations, and product/research stakeholders to ensure the launch of production-ready compute capabilities.Manage integrations and transitions among teams and partners to guarantee seamless execution, transparent communication, and prompt issue resolution.Identify operational bottlenecks and systemic deficiencies, driving sustainable improvements across tooling, processes, and partner interactions.

Mar 12, 2026

Apply

Infrastructure Software Engineer

Sift

Full-time|$150K/yr - $200K/yr|On-site|San Francisco, CA

At Sift, we are revolutionizing the way cutting-edge machines are constructed, tested, and managed. Our innovative platform provides engineers with real-time visibility into high-frequency telemetry, effectively removing bottlenecks and facilitating quicker, more dependable development.Sift originated from our experience at SpaceX, contributing to projects like Dragon, Falcon, Starlink, and Starship, where the demands of scaling telemetry, debugging flight systems, and ensuring mission reliability necessitated a new kind of infrastructure. Founded by a talented team from SpaceX, Google, and Palantir, Sift is tailored for mission-critical systems where precision and scalability are imperative.As one of the pioneering engineers at Sift, your role will extend beyond just coding—you will play a crucial part in defining the architecture, shaping the product, and influencing the culture of a company dedicated to addressing real engineering challenges. If you're eager to take on intricate technical obstacles and build foundational systems that support complex machines from the ground up, we would love to connect with you.

Oct 30, 2025

Apply

Staff Software Engineer, Compute

fal

Full-time|$180K/yr - $250K/yr|On-site|San Francisco

Join our innovative team at fal as a Staff Software Engineer specializing in large-scale computation platforms. We are seeking a seasoned software engineer with extensive experience in developing backend systems that efficiently orchestrate workloads and manage resource constraints. Your expertise in foundational cloud infrastructure and Linux provisioning will be crucial as you work towards achieving high reliability and scalability with minimal operational overhead.

Dec 16, 2025

Apply

Product Infrastructure Software Engineer

Netic

Full-time|On-site|San Francisco

Netic is revolutionizing the essential services sector with our AI-driven revenue engine, empowering the backbone of the American economy.With $43M in funding from leading investors such as Founders Fund, Greylock, Hanabi, and Dylan Field, who spearheaded our Series B, we have enabled our clients to secure hundreds of thousands of jobs across various service industries in North America. Today, numerous companies thrive entirely on an AI-first model powered by Netic.As a member of our team consisting of innovative builders from top organizations such as Scale, Databricks, HRT, Meta, MIT, Stanford, and Harvard, you will be at the forefront of integrating frontier AI into the physical economy, where challenges are complex, data is intricate, and impacts are immediate and substantial.In the role of a founding Product Infrastructure Engineer, you will design and scale the crucial infrastructure that supports our autonomous AI agents, addressing real-world challenges with significant, tangible outcomes. You will work alongside a passionate team of builders to develop infrastructure and processes from scratch, utilizing state-of-the-art cloud and orchestration technologies. If you excel in dynamic, ambiguous settings and are eager to set new benchmarks in the agentic domain, this is your chance to make a lasting impact.

May 30, 2025

Apply

Senior Engineering Manager, Compute

Crusoe

Full-time|On-site|San Francisco, CA - US

Join Crusoe as a Senior Engineering Manager in Compute where you will play a pivotal role in leading cutting-edge engineering teams. You will be responsible for overseeing the development and execution of our innovative computing solutions, ensuring performance and reliability across various platforms.Your leadership will guide teams toward achieving engineering excellence, fostering a collaborative environment, and driving strategic initiatives. This is an opportunity to make a significant impact within a rapidly growing company at the forefront of technology.

Feb 25, 2026

Apply

Engineering Manager, Computer Vision

Glacier

Full-time|$175K/yr - $250K/yr|Hybrid|San Francisco Office

Join our innovative team at Glacier! This hybrid role requires in-office presence on Tuesdays and Thursdays.At Glacier, we're on a mission to address one of the most pressing challenges of our time: waste management. Did you know that over half of recyclables in the U.S. end up in landfills? We're committed to changing that narrative. Our efforts not only aim to enhance recycling practices, but also to mitigate carbon emissions, conserve energy, and protect our natural resources.We develop advanced sorting robots tailored to efficiently separate recyclables, combined with AI-driven business analytics that empower recyclers to optimize their operations and promote a more circular economy.Our technology has garnered the trust of major clients, including Colgate, Amazon, and municipal recycling facilities, as we turn recycling data into actionable insights. Our innovations have been featured in TIME's Best Inventions, a TIME documentary, and various leading publications like TechCrunch, Fortune, and CBS.The Role:We are seeking a dynamic and experienced technical leader to spearhead Glacier's Computer Vision strategy and manage our engineering team. This is a hands-on leadership position overseeing a distributed team across the U.S. and globally.Computer Vision is integral to our product offerings and significantly influences our company's achievements. This position will report directly to the co-founder and CTO.What You’ll Do:Drive the vision, strategy, and execution of Glacier's Computer Vision roadmap.Lead and cultivate our distributed Computer Vision engineering team through hiring, onboarding, and performance management.

Feb 19, 2026

Apply

3D Computer Vision Engineer at Avala | San Francisco

Avala AI

Full-time|$130K/yr - $190K/yr|On-site|San Francisco

Job Category: AI & RoboticsAbout Avala AIAvala AI is a pioneering AI Data Infrastructure company at the forefront of real-world AI and its integration with the labor economy. We excel in delivering high-quality data labeling, comprehensive dataset management, and insightful data visualization, providing 4D labeling solutions tailored for autonomous vehicles, humanoid robots, and drone applications. Our mission is to empower AI-driven sectors—ranging from AV companies to robotics innovators and drone enterprises—by equipping them with the essential data infrastructure to propel the next generation of intelligent systems while offering dignified digital employment opportunities globally.The RoleIn your capacity as a 3D Computer Vision Engineer at Avala AI, you will be responsible for designing and implementing cutting-edge solutions for both offline and online 3D reconstruction and scene understanding, ensuring robustness, accuracy, and performance. You will collaborate on a world-class spatial computing platform deployed extensively in autonomous vehicles, advanced robotic systems, and drone technologies. Your contributions will advance the capabilities of real-world AI while utilizing the latest advancements in deep learning and 3D computer vision techniques.What You’ll DoSpatial Computing & Reconstruction: Innovate through the application of NeRFs, Diffusion Models, Gaussian Splatting, Multiview Stereo, TSDF Fusion, Structure from Motion, and SLAM methodologies.Mission-Critical Perception: Develop robust 3D perception systems and scene understanding frameworks that enhance safety and operational performance across various robotics and AV applications.4D Data Labeling & Visualization: Work collaboratively with cross-functional teams to enhance and expand Avala’s 4D labeling platform for automobiles, humanoid robots, and drones.Software Engineering Best Practices: Apply strong coding, testing, and deployment methodologies to ensure rapid, safe, and efficient development of innovative solutions.Boundary-Pushing Innovation: Actively explore new methodologies and technologies that advance the field of 3D vision, neural rendering, and large-scale data processing.

Jan 1, 2025

Apply

Strategic Deals Lead, Compute & Infrastructure

Anthropic

Full-time|On-site|San Francisco, CA

Join Anthropic as a Strategic Deals Lead focused on our Compute & Infrastructure initiatives. In this pivotal role, you will spearhead the development of strategic partnerships and enhance our infrastructure capabilities. You will work closely with cross-functional teams to optimize operational efficiency while ensuring that our technical solutions are scalable and robust. Your leadership and vision will be crucial in navigating complex negotiations and driving successful outcomes for our organization.

Apr 2, 2026

Apply

Infrastructure Engineer at AeroVect | San Francisco

AeroVect

Full-time|On-site|San Francisco

Join our dynamic engineering team at AeroVect as an Infrastructure Engineer, where you'll play a vital role in the operational success of our advanced autonomy stack. Your expertise will be crucial in developing and optimizing our cloud computing and storage systems, while managing our provisioning backbone to support innovative engineering projects. This is a unique opportunity to shape the future of our development workflow, collaborating with engineers and customers to bring our autonomous driving technology to life at some of the world's busiest airports.In this role, you will engage with cutting-edge techniques and established technologies to accelerate the development and continuous integration of our perception, localization, motion planning, and control systems tailored for various airport driving scenarios. You will work directly with our co-founders and the Autonomy Lead to create a market-leading enterprise product that merges autonomous vehicle technology with a robotics-as-a-service (RaaS) model.Your Responsibilities Include:Leading the hands-on creation of reliable data pipelines and DevOps infrastructure, dedicating approximately 80% of your time to development and 20% to integration with our software stack.Establishing best practices for deploying and maintaining software with exceptional reliability and minimal downtime.Managing data pipelines to efficiently process extensive datasets from the largest airports globally.Enhancing developer efficiency by optimizing workflows, including build systems.Designing systems for large-scale simulations and addressing bottlenecks as they arise.Monitoring AeroVect software on deployed vehicles and devising solutions to any encountered challenges.Collaborating with the engineering team and customers on current and future deployments.

Jan 16, 2022

Apply

ML Infrastructure Engineer

Sygaldry Technologies

Full-time|On-site|San Francisco

About Sygaldry Technologies Sygaldry Technologies develops quantum-accelerated AI servers in San Francisco, focusing on faster AI training and inference. By combining quantum technology with artificial intelligence, the team addresses challenges in computing costs and energy efficiency. Their AI servers integrate multiple qubit types within a fault-tolerant system, aiming for a balance of cost, scalability, and speed. The company values optimism, rigor, and a drive to solve complex problems in physics, engineering, and AI. Role Overview: ML Infrastructure Engineer The ML Infrastructure Engineer joins the AI & Algorithms team, which includes research scientists, applied mathematicians, and quantum algorithm specialists. This role centers on building and maintaining the compute infrastructure that powers advanced research. The systems you build will support reliable GPU access, reproducible experiments, and scalable workloads, so researchers can focus on their core work without needing deep cloud expertise. Expect to design and manage compute platforms for a range of tasks, including quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation. These workloads span multiple cloud providers and on-premises GPU servers. Key Responsibilities Develop compute abstractions for diverse workloads, such as GPU-accelerated simulations, distributed training, high-throughput CPU jobs, and interactive analyses using frameworks like PyTorch and JAX. Set up infrastructure to support experiment tracking and reproducibility. Create developer tools that make cloud computing feel local, streamlining environment setup, job submission, monitoring, and artifact management. Scale experiments from single-GPU prototypes to large, multi-node production runs. Multi-Cloud GPU Orchestration Design orchestration strategies for workloads across multiple cloud providers, optimizing job routing for cost, availability, and capability. Monitor and improve cloud spending, keeping track of credit balances, burn rates, and expiration dates.

Apr 14, 2026

Apply

Senior Product Manager, Compute Platform

Databricks

Full-time|$153K/yr - $210.4K/yr|On-site|San Francisco, California

At Databricks, we are dedicated to empowering data teams to tackle some of the world's most challenging issues — from transforming transportation innovation to fast-tracking medical advancements. Our mission is to create and maintain the premier data and AI infrastructure platform that enables our clients to harness deep data insights for business enhancement. Founded by engineers with a customer-first mindset, we embrace every challenge, whether it's designing cutting-edge UI/UX for data interaction or scaling our services across millions of virtual machines. And we're just getting started.We are seeking a strategic, customer-centric, and results-driven Senior Product Manager for our Compute Platform. This platform is essential for powering Databricks workloads, offering various compute options for diverse tasks — including Classic Compute, SQL Warehouses, and Serverless Compute. It plays a critical role in many of our products and features.This position will involve extensive collaboration across teams and requires a robust technical background. You will coordinate product initiatives from conception through execution, engage with large enterprise clients to gauge their requirements, develop long-term product strategies, define product roadmaps, collaborate with engineering to create these products, and interact with various internal stakeholders (both pre- and post-launch) to ensure the success of our offerings.

Feb 1, 2026

Apply

Data Infrastructure Engineer

Roboflow Inc.

Full-time|$4K/mo - $4K/mo|Remote|NY, SF or Remote

About UsAt Roboflow, our mission is to empower developers to make the world programmable through advanced artificial intelligence solutions. We believe that vision is a fundamental way we comprehend our environment, and soon, this understanding will be reflected in the software we utilize.We are dedicated to creating tools, fostering community, and providing resources that simplify the development and deployment of computer vision models. With over 1 million developers, including teams from half of the Fortune 100, leveraging Roboflow's open-source and hosted machine learning tools, we are on a mission to enhance various industries—from accelerating cancer research through cell counting to improving construction site safety, digitizing floor plans, preserving coral reef ecosystems, guiding drone operations, and much more.Our compact team is driven by a culture of collaboration, where we believe that our users' success is our success. One of our team members aptly described us as a company of

Feb 27, 2026

Apply

Software Engineer for AI Infrastructure

Eventual Computing

Full-time|On-site|San Francisco

About EventualAt Eventual, we are reimagining how AI applications process vast amounts of data, from images to complex datasets. Traditional data platforms are not equipped to handle the petabytes of multimodal data essential for AI, causing teams to struggle with inadequate infrastructure. Founded in 2022, our mission is to simplify data querying, making it as intuitive as working with tables while ensuring scalability for production workloads.Our open-source engine, Daft, is specifically designed for real-world AI systems. It efficiently manages external APIs, GPU clusters, and addresses failures that traditional engines cannot handle. Daft is already integral to operations at leading companies such as Amazon, Mobileye, Together AI, and CloudKitchens.We pride ourselves on our exceptional team, which includes talents from Databricks, AWS, Nvidia, Pinecone, GitHub Copilot, Tesla, and others. We have quadrupled our team size in just a year, supported by Series A and seed funding from notable investors like Felicis, CRV, Microsoft M12, and Y Combinator. We are now eager to expand further. Join us—Eventual is just getting started.We are seeking passionate individuals who are excited to collaborate in a close-knit team environment, working together four days a week in our San Francisco Mission district office.Your Role:As a Software Engineer, you will take charge of developing Eventual's core products and architecture. You’ll deliver features that our customers will use immediately and collaborate with a dedicated team that values open communication and cross-functional teamwork. Our fast-paced environment is focused on solving a variety of complex technical and product challenges. While our experienced team is here to provide guidance and mentorship, we appreciate engineers who can independently identify and tackle challenging technical issues.Key Responsibilities:Design and develop highly reliable and resilient products and features.Collaborate closely with cross-functional product and customer-facing teams to understand requirements and deliver thoughtful solutions.Write high-quality, extensible, and maintainable code.Create and build scalable applications and components.Architect and manage Kubernetes clusters optimized for our needs.

Sep 22, 2025

Apply

Senior Machine Learning Infrastructure Engineer

Abridge

Full-time|On-site|SF Office

About AbridgeAbridge, established in 2018, is dedicated to enhancing the understanding of healthcare through advanced AI technology. Our platform is specifically designed for medical conversations, streamlining clinical documentation processes and allowing healthcare professionals to prioritize patient care.Our robust technology converts patient-clinician dialogues into structured clinical notes in real-time, integrating seamlessly with electronic medical records (EMR). With our unique Linked Evidence approach and auditable AI framework, we are the sole entity that aligns AI-generated summaries with verified ground truths, fostering trust among healthcare providers. As leaders in generative AI within the healthcare sector, we are committed to setting benchmarks for the ethical implementation of AI across health systems.Our dynamic team comprises practicing MDs, AI researchers, PhDs, creative thinkers, technologists, and engineers, all collaborating to empower individuals and enhance the healthcare experience. We have offices located in San Francisco's Mission District, New York's SoHo, and Pittsburgh's East Liberty.The RoleAs a Senior Machine Learning Infrastructure Engineer at Abridge, you will be essential in constructing and refining the core infrastructure that supports our machine learning models. Your contributions will be crucial in boosting the scalability, efficiency, and performance of our AI solutions. You will collaborate with the Infrastructure and Research teams to build, deploy, optimize, and orchestrate our AI models.What You'll DoDesign, deploy, and maintain scalable Kubernetes clusters for AI model training and inference.Develop, optimize, and maintain high-performance ML serving and training infrastructure, ensuring minimal latency.Work alongside ML and product teams to enhance backend infrastructure for AI-driven applications, focusing on model deployment and efficiency.Improve compute-intensive workflows and maximize GPU utilization for ML tasks.Create a robust orchestration system for model APIs.Partner with leadership to formulate and execute strategies for scaling infrastructure as the company expands, guaranteeing sustained efficiency and performance.

Aug 25, 2025

Apply

Software Engineer - Infrastructure (Mid to Senior Level)

Julius

Full-time|On-site|San Francisco, CA

Julius operates as an applied AI lab, developing advanced coding agents for a broad user base. The platform executes about 1 million lines of code every 36 hours, serves over 1 million users, and generates more than 3 million visualizations. All code runs in tightly managed, isolated sandboxes. Julius is a revenue-generating business backed by AI Grant, YCombinator, Bessemer Venture Partners, and founders from leading technology companies. Role overview This mid to senior level Software Engineer - Infrastructure role focuses on designing and scaling the code-execution sandboxes that form the backbone of Julius. The infrastructure spans cloud platforms such as AWS and GCP, orchestrating over 500,000 containers each month. The main priorities are reliability, performance, and security in a multi-tenant compute environment. What you will do Design and maintain secure, multi-tenant container infrastructure with rapid startup and intelligent autoscaling. Deploy and manage cloud resources using Helm and Terraform, including SSO, network controls, and audit logging. Enhance observability through metrics, traces, and logs. Define SLOs and lead incident response efforts. Optimize container images, scheduling, networking, and costs. Develop and enforce fair-use and rate-limiting policies. Requirements Hands-on experience with production Kubernetes and container internals (Docker or containerd), as well as strong networking skills. Familiarity with cloud services (AWS, GCP, or Azure) and Infrastructure as Code tools such as Terraform and Helm. Proficiency with monitoring and logging tools like Prometheus, Grafana, OpenTelemetry, ELK, or Vector. Understanding of security best practices for containerized, multi-tenant systems. Preferred qualifications Experience with technologies such as gVisor, Kata, Firecracker, Cilium, eBPF, GPU scheduling, or serverless autoscaling frameworks (KEDA, Knative, Karpenter). Interest in AI projects, especially those involving large language models (LLMs). Benefits and compensation Competitive base salary Substantial equity options Comprehensive health and dental coverage Gym reimbursement Daily team meals Commuter assistance Julius offers the chance to work in San Francisco, CA, alongside a small and highly skilled team tackling large-scale infrastructure challenges. The systems here operate at significant scale and complexity, providing opportunities to solve demanding technical problems in a collaborative setting.

Apr 23, 2026

Apply

AI Compute and Infrastructure Counsel

Reflection AI

Full-time|On-site|San Francisco

Reflection AI builds open weight models for a wide range of users, including individuals, businesses, and governments. The team brings together talent from organizations such as DeepMind, OpenAI, Google Brain, Meta, Character.AI, and Anthropic, all working to advance open superintelligence. Role overview The AI Compute and Infrastructure Counsel acts as the main legal advisor to Reflection AI’s Strategy and Operations teams on complex infrastructure initiatives. Based in San Francisco, this attorney leads negotiations and manages agreements that support the company’s growing AI infrastructure. The work spans collaborations with hardware manufacturers, cloud capacity deals, and contracts related to data centers, utilities, and new facility builds. This position is designed for a commercial lawyer with experience at the intersection of advanced AI and infrastructure. The role provides autonomy, the opportunity to establish legal frameworks for a new function, and a direct impact on the company’s AI systems. What you will do Negotiate compute and cloud capacity agreements with hyperscalers, neoclouds, and new vendors, covering terms like capacity reservations, service-level commitments, portability, and exit rights. Manage hardware partnerships with vendors in chips, accelerators, servers, and networking. Oversee legal support for data center and AI facility projects, including master agreements for colocation and hosting, ground leases, build-to-suit leases, construction contracts, interconnection agreements, and power purchase agreements. Structure and negotiate power arrangements, such as power purchase agreements, tolling agreements, utility service contracts, behind-the-meter generation, and long-term energy deals. Lead legal work on strategic infrastructure transactions, including joint ventures, site acquisitions, and custom financing models for the AI factory roadmap. Develop scalable playbooks, templates, and delegation systems to help commercial and infrastructure teams operate efficiently and maintain high standards. Collaborate with Security, Privacy, and Policy teams on matters like tenant isolation, customer data handling, and sovereign compute requirements.

Apr 28, 2026

Create account — see all 8,358 results

1 - 20 of 8,358 Jobs

Select all on this page (20)

Apply

Engineering Manager - Compute Infrastructure

Databricks

Full-time|$190K/yr - $253.8K/yr|On-site|Mountain View, California; San Francisco, California

Feb 13, 2026

Apply

Infrastructure Manager - Compute Markets

Andromeda Cluster