Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Ideal candidates will possess a strong background in machine learning infrastructure, particularly with experience in distributed systems. Familiarity with GPU utilization and optimization techniques is essential. Proficiency in programming languages such as Python and experience with ML frameworks like PyTorch or JAX are required. A collaborative mindset and a commitment to improving ML processes are vital for success in this role.
About the job
Join our team at Mind Robotics as a Machine Learning Infrastructure Engineer, where you'll play a pivotal role in developing the systems that facilitate effective large-scale model training. This position is ideal for individuals who thrive in high-scale environments—overseeing distributed training, managing core ML infrastructure, and leveraging rapid iteration loops across hundreds of GPUs. If you have experience building or managing large training systems in frameworks like PyTorch or JAX and have a passion for optimizing processes such as sharding, parallelism, and performance, you'll find a welcoming environment here. Collaborate closely with researchers to minimize friction, enhance reliability, and streamline the processes for training, evaluating, and deploying models that integrate into real-world applications.
About Mind Robotics
Mind Robotics is at the forefront of innovation in artificial intelligence and machine learning. Our mission is to develop cutting-edge technologies that transform industries and improve the everyday lives of people around the globe. We foster a dynamic and inclusive workplace that values creativity, collaboration, and excellence.
At Ricursive Intelligence, we are pioneering advancements in artificial intelligence by creating self-improving systems with a focus on innovative chip design. Our mission is to revolutionize chip development, effectively bridging the gap between AI and the hardware that supports it, thereby accelerating the journey towards artificial superintelligence.We are seeking exceptional engineers who are passionate about tackling a wide range of challenges in scaling, low-level optimization, and the fundamental infrastructure necessary for large language model training and inference.
About UsHippocratic AI is at the forefront of generative AI in healthcare, possessing the only system capable of conducting safe, autonomous clinical conversations with patients. Our proprietary LLMs, part of the Polaris constellation, boast an impressive accuracy rate exceeding 99.9%.Why Join Our TeamBe a part of a healthcare revolution with AI centered on safety. We are developing the first healthcare-exclusive, safety-driven LLM — a groundbreaking platform aimed at significantly improving patient outcomes globally. This is an opportunity to shape a new category in healthcare technology.Collaborate with pioneers in the field. Co-founded by CEO Munjal Shah and a talented group of physicians, hospital leaders, AI experts, and researchers from prestigious institutions such as El Camino Health, Johns Hopkins, and Stanford, among others.Supported by top-tier investors in healthcare and AI. We have successfully raised $126M in Series C funding, achieving a valuation of $3.5B, with total funding reaching $404M from notable participants including Avenir Growth, CapitalG, and a16z.Work alongside industry leaders. Join a team of experts dedicated to enhancing care, advancing scientific knowledge, and developing transformative technologies that ensure our platform remains powerful and reliable.Location RequirementThis role requires in-office presence at our Palo Alto facility five days a week to foster collaboration and a strong team culture.
Senior AI Engineer – LLM and Retrieval-Augmented Generation Bright.AI is a rapidly growing Physical AI enterprise that is revolutionizing the way organizations engage with the physical environment through advanced intelligent automation. Our AI platform analyzes visual, spatial, and temporal data from countless real-world events captured via edge devices, mobile sensors, and cloud infrastructures to facilitate intelligent decision-making on a large scale. We are seeking a Senior AI Engineer specializing in LLM and RAG to spearhead the development of Retrieval-Augmented Generation (RAG) systems. These systems utilize large language models (LLMs) combined with real-time knowledge sources to create next-generation intelligent assistants that aid technicians and operators in diagnosing and resolving intricate issues in industrial contexts. Your role will place you at the intersection of Natural Language Processing (NLP), foundational models, and real-time information systems, designing intelligent tools that convert manuals, technician notes, and sensor data into actionable conversational guidance for real-world applications.
Full-time|$100K/yr - $200K/yr|On-site|Palo Alto, California, United States
Join the innovative team at OPPO US Research Center as a full-time AI/LLM Test Engineer. In this pivotal role, you'll assess the performance, reliability, and safety of Large Language Models (LLMs) within real-world applications while testing comprehensive generative AI solutions. Your efforts will directly enhance user experiences with AI-driven features by ensuring their robustness, accuracy, and alignment with product goals. This is a unique chance to lead the development of testing methodologies for groundbreaking AI technologies.Additionally, we are looking for a Contractor LLM Evaluation & QA Engineer to assist in the evaluation and validation of LLM-powered applications. You will play a key role in implementing testing strategies, executing evaluation workflows, and validating model performance across various generative AI scenarios.This contract position is perfect for individuals with hands-on experience in AI/ML evaluation, QA engineering, or data analysis who are eager to expand their knowledge of generative AI systems.
Ricursive Intelligence is at the forefront of AI innovation, dedicated to developing self-enhancing systems with a strong emphasis on chip design. Our mission is to transform chip development and create a seamless connection between artificial intelligence and the hardware that powers it, thereby accelerating the journey towards artificial superintelligence.We are on the lookout for top-tier researchers to engage in groundbreaking AI research, tackling a diverse array of challenges associated with LLM modeling, training, data management, evaluation, and beyond. As a dynamic startup, our team is highly collaborative and hands-on; researchers are empowered to design and execute large-scale experiments and to build and deploy models in a production environment.
Role Overview Pylon is hiring an Infrastructure Engineer, Foundation, based in Palo Alto. This role focuses on designing, implementing, and maintaining infrastructure that supports the company’s core products and services. The work directly supports operational reliability and technical growth across the organization. What You Will Do Design and build infrastructure solutions to support Pylon’s main offerings Maintain and improve existing systems to ensure reliability and performance Work with teams across engineering and other functions to integrate and support infrastructure needs Identify opportunities to optimize system performance and scalability Who We’re Looking For Proactive approach to problem solving and infrastructure development Interest in building scalable systems and improving performance Comfort working closely with cross-functional teams
Join Mashgin as an Infrastructure Software Engineer and be a part of our innovative team dedicated to enhancing the efficiency of our cutting-edge technology. You will play a critical role in designing, developing, and maintaining robust infrastructure systems that power our products and services. Your expertise will help us streamline operations, improve performance, and ensure reliability across our platforms.
Full-time|On-site|Palo Alto, California, United States
Role overview BitGo is looking for a Senior Infrastructure Engineer in Palo Alto, California. This role focuses on building and maintaining the company's infrastructure to support reliable, high-performing services. What you will do Work with teams across the company to design, implement, and improve infrastructure systems Ensure systems remain highly available and deliver strong performance Apply cloud technologies and infrastructure as code practices to support and enhance services What we look for Experience with cloud platforms Strong background in infrastructure as code Ability to collaborate with engineers from different disciplines
Full-time|$145K/yr - $192K/yr|Hybrid|Palo Alto, California - Hybrid/Remote - United States
Senior Software Engineer, Infrastructure About Ladder At Ladder, we identified a significant issue in the life insurance sector: the lengthy application process, the excessive paperwork, and the numerous in-person meetings with agents. Motivated by personal loss, our CEO, Jamie, set out to simplify the process of obtaining essential coverage for families. We innovated real-time underwriting using AI, transforming the months-long life insurance application into a matter of minutes. Our user-friendly digital experience ensures instant decisions and has garnered exceptional user reviews, with over $74 billion in coverage issued. About the Role We are in search of a Senior Software Engineer who will enhance developer productivity within Ladder's engineering team. You will take charge of modernizing our CI/CD pipelines, build systems, and developer tools, while also contributing to the robustness of our cloud infrastructure and data platform. The ideal candidate will possess a proven track record in software engineering, demonstrate leadership qualities, and have a thorough understanding of engineering infrastructure. This position is remote, available in any of the 22 states where Ladder is hiring: AZ, CA, CO, CT, FL, GA, KS, MA, MD, MN, NC, NH, NJ, NV, NY, OH, OR, PA, TX, VA, WA, WI. Please note that Ladder is not sponsoring or transferring OPT or H1-B visas at this time. How You’ll Make a Difference As a senior engineer in our team, your responsibilities will extend beyond coding; you will influence our platform strategy. Your contributions will include: Enhancing developer velocity across the engineering organization by measuring and optimizing the developer workflow, which encompasses build times, test parallelization, deployment speeds, and daily tooling. Shaping the architecture of Ladder’s production infrastructure by evaluating design trade-offs and making impactful technical decisions, such as transitioning from custom monitoring tools to native cloud provider integrations or redefining data pipeline rebuild processes in response to upstream logic changes. You will have the insight to see the holistic view across systems and determine where to allocate engineering resources effectively. Engaging in incident response for infrastructure issues, leading retrospectives, and ensuring actionable follow-through on resolutions.
Join our innovative team at Xai as an Infrastructure Security Engineer. In this role, you will be instrumental in safeguarding our infrastructure, ensuring that our systems are secure against evolving threats. You will collaborate with cross-functional teams to implement security measures and best practices.
About Rhoda AI Rhoda AI is building the next generation of humanoid robotics, combining high-performance, software-defined hardware with advanced foundational and video world models. Our robots are designed as adaptable generalists, able to handle complex real-world settings and new challenges. The team works closely with leading researchers from Stanford, Berkeley, Harvard, and other institutions. With over $400M raised, Rhoda AI is investing heavily in research, hardware, and scaling up manufacturing to bring these robots to life. Role Overview The Cloud Infrastructure Engineer will design and manage the systems that power Rhoda AI's robotics and AI platform. This role covers infrastructure for training data collection, robot fleet maintenance, and model training and evaluation. High reliability and low latency are essential throughout. The systems built here form the backbone of product delivery. What You Will Do Design, build, and maintain cloud infrastructure for data pipelines, robot operations, and model training and evaluation. Keep critical infrastructure components, databases, data warehouses, object storage, reliable, available, and fast. Develop and manage backend services and APIs that deliver infrastructure capabilities to internal and external users. Troubleshoot and resolve performance bottlenecks in the data and compute stack to meet strict latency and throughput goals. Work with research teams to translate model training and evaluation needs into scalable infrastructure solutions. Partner with robotics teams to ensure field operations have dependable, low-latency backend support. Create observability tools, including metrics, logging, and alerting systems, to spot and address infrastructure issues early. Set and follow best practices for infrastructure security, cost efficiency, and scalability. Take part in on-call rotations and contribute to incident response and retrospective reviews. What We Look For 4+ years of experience in cloud infrastructure engineering or a closely related area. Skilled in designing and implementing scalable cloud architectures. Hands-on experience with databases, data warehouses, and object storage systems. Strong programming ability in Python, Java, or Go. Excellent problem-solving skills and attention to detail. Collaborative approach and ability to work well in a fast-moving team. Location Palo Alto
Full-time|$137.9K/yr - $240K/yr|On-site|Palo Alto, California, United States
Senior Software Engineer, Cloud & Infrastructure | Software EngineeringPalo Alto, CA (on-site)At 1X, we are at the forefront of innovation, developing humanoid robots that collaborate with humans to address labor shortages and foster abundance.In this pivotal role, you will spearhead the design and implementation of sophisticated software that bridges the physical and digital realms of our global robotic operations. From deployment tools and fleet management solutions to customer interfaces and internal operational platforms, your goal is to develop systems that can scale from hundreds to tens of thousands of robots. You will take charge of architectural decisions, construct core components, and mentor engineers across the technology stack, ensuring reliability, simplicity, and performance throughout.
Full-time|On-site|San Francisco, CA, US; Palo Alto, CA, US
About the Role Pinterest is looking for an Engineering Manager II to guide the Infrastructure team. This group builds and maintains the systems that keep Pinterest running smoothly and reliably at scale. What You Will Do Lead a team of engineers focused on infrastructure projects Shape technical strategy and direction for core systems Work to ensure high availability and strong performance across services Location This role is based in San Francisco, CA or Palo Alto, CA.
Full-time|$137.9K/yr - $240K/yr|On-site|Palo Alto, California, United States
Product Security Engineer, Cloud & InfrastructurePalo Alto, CA (on-site)About 1XAt 1X, we are pioneering the future of work by developing humanoid robots that collaborate with humans to address labor shortages and foster abundance.The RoleAs a Product Security Engineer specializing in cloud and infrastructure, your role will be crucial in designing and maintaining secure architectures, building cloud-native security services, and safeguarding CI/CD pipelines. You will play a key part in ensuring secure communications and protecting data across 1X’s robotics platforms.Your ResponsibilitiesCollaborate closely with engineering teams to develop and maintain security-critical cloud services.Implement infrastructure-as-code security practices for consistent and secure deployments.Design and manage identity and access systems enforcing just-in-time and least-privilege access.Secure CI/CD pipelines against poisoning and credential theft while ensuring artifact integrity.Create and manage cloud-native security services for device authentication, data protection, and secure communication.Architect secure cloud networks through VPC segmentation, traffic filtering, and access control.Configure and oversee cloud security posture management (CSPM) tools to identify misconfigurations and support incident response across environments.
About UsOdyssey is at the forefront of artificial intelligence innovation, dedicated to developing general-purpose world models. These models represent a cutting-edge form of multimodal intelligence that paves the way for transformative applications across consumer, enterprise, and intelligence sectors. Our groundbreaking work, including advancements showcased in Odyssey-2 Pro, positions us as leaders in the next major frontier of AI.Position OverviewWe seek a passionate Infrastructure Engineer who excels in creating the systems that enable pioneering research and product development. You possess a systems-oriented mindset, are driven by performance, and thrive on converting theoretical limitations into practical, efficient solutions. Your mission will be to design and maintain an infrastructure that supports Odyssey's world models, facilitating real-time imagination, action, and interaction.Key ResponsibilitiesDevelop and manage a low-latency model inference platform, ensuring optimal availability, scalability, and resource efficiency for Odyssey’s world models.Engineer and expand our core data processing infrastructure (e.g., Flyte, Ray with Kubernetes) to manage petabyte-scale datasets effectively.Design, construct, and maintain large-scale, GPU-based training clusters for deep learning, emphasizing usability, throughput, and reliability.Automate infrastructure provisioning and monitoring using Infrastructure as Code (IaC) principles.Enhance performance tuning, cost management, and reliability across the technology stack.Work collaboratively with researchers and product developers to comprehend their needs, streamline workflows, and elevate platform usability.
Full-time|$180K/yr - $440K/yr|On-site|Palo Alto, CA
About xAIAt xAI, we are driven by our mission to develop AI systems that profoundly understand the universe and assist humanity in its quest for knowledge. Our team is composed of passionate individuals who thrive on challenges and curiosity, emphasizing engineering excellence. We maintain a flat organizational structure where every member is expected to actively contribute to our mission. Leadership is earned through initiative and consistent delivery of excellence, fostering a strong work ethic and prioritization skills. Effective communication is essential, enabling team members to share insights and knowledge clearly.About the RoleThe Compute Infrastructure team at xAI is tasked with the design, construction, and management of extensive clusters and orchestration platforms that facilitate cutting-edge AI training, inference, and agent workloads at an unprecedented scale. In this role, you will redefine container orchestration beyond current systems like Kubernetes, manage exascale computing resources, optimize for high-performance training runs and production services, and work closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that powers xAI's next-generation models and applications.ResponsibilitiesConstruct and oversee large-scale clusters to host, persist, train, and serve AI workloads with exceptional reliability and performance.Design, develop, and enhance an in-house container orchestration platform that surpasses off-the-shelf solutions in scalability, isolation, resource efficiency, and fault-tolerance.Collaborate with research teams to architect and optimize compute clusters tailored for extensive training runs, inference services, and real-time applications.Profile, debug, and resolve intricate system-level performance bottlenecks, resource contention, scheduling dilemmas, and reliability issues across the entire stack.Take ownership of end-to-end infrastructure initiatives employing first-principles design, rigorous testing, automation, and continuous optimization to meet the demands of frontier AI compute.
Join our team at Mind Robotics as a Machine Learning Infrastructure Engineer, where you'll play a pivotal role in developing the systems that facilitate effective large-scale model training. This position is ideal for individuals who thrive in high-scale environments—overseeing distributed training, managing core ML infrastructure, and leveraging rapid iteration loops across hundreds of GPUs. If you have experience building or managing large training systems in frameworks like PyTorch or JAX and have a passion for optimizing processes such as sharding, parallelism, and performance, you'll find a welcoming environment here. Collaborate closely with researchers to minimize friction, enhance reliability, and streamline the processes for training, evaluating, and deploying models that integrate into real-world applications.
Full-time|On-site|Pittsburgh, PA, Palo Alto, CA, Detroit, MI
Role Overview Latitude is looking for a Senior Software Engineer focused on Test Infrastructure. This role centers on strengthening testing frameworks to help deliver reliable software. The position is available in Pittsburgh, PA, Palo Alto, CA, or Detroit, MI. What You Will Do Work closely with teams across engineering, product, and QA to support development efforts. Design, build, and maintain test infrastructure that supports software quality. Help improve and extend frameworks used for automated and manual testing.
Role overview Speechify seeks a Software Engineer specializing in Data Infrastructure and Acquisition at its Palo Alto, CA office. This position focuses on building and refining the data pipelines and backend systems that support Speechify’s text-to-speech products. What you will do Design, develop, and improve data pipelines to meet product and business requirements Collaborate with engineering, product, and data teams to maintain reliable data flows Contribute to systems that support data-driven decisions and ongoing product improvements
At Rhoda AI, we are pioneering the development of a comprehensive technology stack for the future of humanoid robotics. Our focus ranges from high-performance, software-defined hardware to cutting-edge foundational models and video world models that govern these systems. Our robots are engineered as versatile generalists, adept at navigating complex, real-world scenarios that extend beyond conventional training environments. Collaborating at the forefront of large-scale learning, robotics, and systems, our research team comprises distinguished experts from renowned institutions such as Stanford, Berkeley, and Harvard. With an impressive funding of over $400 million, we are committed to substantial investments in research and development, hardware innovation, and the scaling of manufacturing processes to bring our vision to life.Position OverviewWe are currently seeking a Senior ML & Data Infrastructure Engineer to take ownership of and enhance our data model training pipeline. This role encompasses the entire lifecycle, from raw data ingestion and storage to sophisticated indexing, retrieval, and throughput optimization at an unprecedented scale.Key ResponsibilitiesDesign, develop, and scale a robust data infrastructure capable of processing and managing billions of video clips while ensuring reliability, low latency, and cost-effectiveness.Create and optimize large-scale storage solutions, including cloud object storage and databases, tailored for multimodal datasets.Develop high-performance indexing and retrieval systems to facilitate rapid dataset querying, filtering, and iteration for both research and production applications.Establish observability frameworks for data pipelines that encompass monitoring, alerting, failure recovery, and performance enhancements.Implement intelligent workload distribution and throughput enhancements across distributed compute and storage infrastructures.Oversee data artifacts, versioning, and lineage to guarantee reproducibility and traceability throughout training cycles.Create user-friendly internal interfaces and lightweight tools that empower researchers and engineers to explore, query, and analyze extensive datasets efficiently.Facilitate the integration and scalable deployment of vision-language models (VLMs) within data pipelines for purposes such as screening, enrichment, or metadata generation.QualificationsA minimum of 5 years of experience in data infrastructure, distributed systems, machine learning infrastructure, or a closely related field.Proven expertise in developing and managing large-scale data pipelines and storage solutions.Strong programming skills in languages such as Python, Java, or Scala, and proficiency with data processing frameworks.Experience with cloud-based storage solutions and databases, as well as knowledge of multimodal data management.Ability to work collaboratively in a fast-paced, innovative environment.
Mar 10, 2026
Sign in to browse more jobs
Create account — see all 560 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.