Infrastructure Engineer For Large Scale Ai Training jobs in San Jose – Browse 641 openings on RoboApply Jobs

Infrastructure Engineer For Large Scale Ai Training jobs in San Jose

Open roles matching “Infrastructure Engineer For Large Scale Ai Training” with location signals for San Jose. 641 active listings on RoboApply Jobs.

641 jobs found

1 - 20 of 641 Jobs
Apply
Hark logoHark logo
Full-time|On-site|San Jose

About Hark Hark is at the forefront of artificial intelligence, dedicated to developing advanced and personalized systems that are proactive, multimodal, and capable of seamless interaction through speech, text, vision, and persistent memory. We aim to revolutionize the interface between humans and machines by integrating advanced intelligence with next-gene…

Apr 30, 2026
Apply
OKX logoOKX logo
Full-time|On-site|San Jose, California, United States

Join OKX as a Staff AI Engineer specializing in Model Post-Training and Alignment. In this pivotal role, you will lead initiatives that enhance the performance and alignment of AI models. You will work with cutting-edge technologies and collaborate with cross-functional teams to drive innovation in AI solutions.

Mar 18, 2026
Apply
Etched logoEtched logo
Full-time|On-site|San Jose

About EtchedEtched is pioneering the world's first AI inference system specifically designed for transformers, offering over 10x greater performance while significantly reducing costs and latency compared to traditional options like the B200. With Etched ASICs, you can create products that were previously unattainable with GPUs, such as real-time video generation models and highly complex reasoning agents. Supported by substantial investments from leading investors and a team of top engineers, Etched is revolutionizing the infrastructure layer for one of the fastest-growing industries globally.Job SummaryAs an Infrastructure Software Engineer, you will be essential in developing state-of-the-art model-specific ASICs by constructing custom infrastructure and toolchains. This role focuses on ensuring ultra-fast, reliable, and scalable development from simulation to silicon. At Etched, we approach infrastructure development with the same best practices that we apply to our products, incorporating rigorous design discipline and high-quality standards in our testing processes.You will spearhead the creation and adoption of next-generation infrastructure tools, empowering our ASIC, Software, and Platform engineers to accelerate iterations, increase reliability, and expand the frontiers of AI performance. Responsibilities include building and optimizing our hybrid high-performance compute (HPC) cluster for extensive parallel CI, EDA workflows, emulation, and hardware-aware job execution.Additionally, you will design and implement an advanced observability stack featuring LLM integration, focusing on health and performance telemetry, log aggregation, distributed tracing, insight generation, synthetic testing, and intelligent alerting across CI pipelines, simulation clusters, and service endpoints.This role demands a robust software engineering mindset, quality orientation, and a comprehensive understanding of systems. It involves not just writing scripts, but creating infrastructure code with precision, repeatability, and purpose.Key ResponsibilitiesArchitect and Scale Distributed Compute Systems: Design and build the orchestration layers driving our hybrid high-performance clusters—facilitating simulation, synthesis, and continuous integration of AI ASICs at an unprecedented scale.Build Infrastructure-as-Code Systems: Develop and maintain a fully programmable infrastructure control plane to guarantee reproducibility, auditability, and swift iteration throughout the entire stack.

Jan 20, 2026
Apply
ChipStack logoChipStack logo
Full-time|On-site|San Jose

About UsAt ChipStack, we stand at the forefront of the technological revolution, aiming to transform the design of silicon chips which are pivotal in today's tech-centric landscape. While their complexity has surged due to rising performance requirements from applications such as AI, the methods used for their design have remained stagnant for decades. We are determined to innovate and lead this change.Our team is dynamic, highly skilled, and operates with agility. We are comprised of experts with experience at leading technology companies including Qualcomm, Nvidia, Google, Meta, and the Allen Institute for AI. Supported by premier investors such as Khosla Ventures, Cerberus, and Clear Ventures, we have already partnered with over 10 pioneering clients, ranging from Fortune 100 giants to ground-breaking AI silicon startups.About This RoleThis position offers an exceptional opportunity to join the founding team at ChipStack, where we are redefining the approach to modern silicon chip design. Collaborating with seasoned chip designers, machine learning scientists who have successfully trained large language models (LLMs) at scale, and top-tier infrastructure and software engineers, you will leverage your expertise in ML and data infrastructure to tackle some of the most challenging problems in chip design.About YouYou thrive in a startup environment, drawn by the energy and dynamism it provides. You are committed to delivering outstanding customer experiences, willing to go the extra mile to ensure satisfaction. Self-motivated and driven, you possess a strong sense of urgency and the ability to work independently with minimal guidance. You welcome complex problems and relish the opportunity to explore uncharted territories.This RoleWe seek a skilled and experienced ML Infrastructure Engineer to join our founding team. The ideal candidate will have a solid background in designing and scaling ML infrastructure and training pipelines. Your primary responsibility will be to construct the foundational infrastructure that supports the training, fine-tuning, evaluation, and deployment of LLMs in both cloud and on-premise environments. Your contributions will significantly enhance our product capabilities and accelerate our iteration processes.

Jun 22, 2025
Apply
Hark logoHark logo
Full-time|On-site|San Jose

About Hark Hark is at the forefront of artificial intelligence, dedicated to creating sophisticated, personalized intelligence systems. Our technology is proactive, multimodal, and designed to interact seamlessly with the world through speech, text, vision, and persistent memory. We are revolutionizing the AI landscape by integrating this intelligence with cutting-edge hardware, establishing a universal interface between humans and machines. Unlike traditional AI, which relies on outdated chat interfaces and devices, Hark is pioneering the future: agentic systems that communicate naturally with individuals and their environments. To achieve this vision, we are collaboratively developing multimodal models alongside next-generation AI hardware, engineered from inception as a cohesive interface for the new era of intelligent systems. About the Role We are seeking a Mid-Level Technical Staff Member specializing in AI training to spearhead the development of innovative training strategies that effectively connect pre-training and post-training phases. This role is crucial in shaping how our models gain advanced reasoning, planning, and tool-utilization skills at scale. You will be instrumental in the core of model capability development, determining how data, algorithms, and systems converge to unlock the next level of agent behavior. Responsibilities Design and implement mid-training strategies to enhance agent capabilities such as reasoning, planning, tool usage, and long-term decision-making. Scale synthetic data generation pipelines (e.g., coding, agent trajectories, multimodal data) and optimize data mixtures to boost downstream reinforcement learning performance. Construct and optimize distributed training pipelines for large models, ensuring efficiency, stability, and scalability across GPU clusters. Develop and refine evaluation frameworks to assess model capabilities (e.g., task success, reasoning quality, tool usage accuracy) and guide training refinements. Conduct comprehensive experimentation and ablations to clarify training dynamics, scaling behavior, and identify bottlenecks. Collaborate across functions with pre-training, post-training, and product teams to synchronize model development with practical agent applications. Lead technical innovations in areas such as long-context learning, data distillation, and training efficiency, contributing to the overall model roadmap.

Apr 30, 2026
Apply
Etched logoEtched logo
Internship|On-site|San Jose

About EtchedAt Etched, we are pioneering the development of the world’s first AI inference system specifically designed for transformers, achieving over ten times higher performance along with significantly reduced costs and latencies compared to traditional B200 systems. Our innovative ASIC technology empowers the creation of groundbreaking products, including real-time video generation models and advanced reasoning agents capable of deep and parallel chain-of-thought processes. Supported by substantial investments from top-tier VCs and a team of expert engineers, Etched is transforming the infrastructure landscape of the fastest-growing industry in history.Job SummaryAs an Infrastructure Intern, you will play a crucial role in the evolution and implementation of next-generation infrastructure tools. Your contributions will enable our ASIC, software, and platform engineers to innovate more swiftly, enhance reliability, and expand the frontiers of AI performance. Your responsibilities will include working on hybrid-cloud high-performance computing (HPC) clusters, executing massively parallel continuous integration processes, implementing Infrastructure-as-Code, developing a scalable observability platform integrated with large language models, and creating high-quality tools that engineers will enjoy using.

Feb 7, 2026
Apply
Etched logoEtched logo
Full-time|On-site|San Jose

About Etched Etched builds AI inference systems designed specifically for transformer models. The company’s ASICs offer over 10x higher performance with lower costs and latency than traditional solutions such as the B200. These advances support real-time video generation and complex reasoning agents. Etched is backed by major investors and a team of experienced engineers, working to reshape infrastructure for one of the fastest-growing technology sectors. Role Overview The Technical Program Manager for Infrastructure leads programs from initial planning through to delivery. This role coordinates efforts across internal teams, manages vendor relationships, and keeps all dependencies, risks, and milestones visible to stakeholders. Setting key performance indicators and establishing operational rhythms are central parts of the job, along with ongoing process improvements that help the Infrastructure organization work more efficiently. What You Will Do Drive cross-functional programs from start to finish, ensuring alignment across engineering, leadership, and external partners. Translate complex technical concepts into clear updates for stakeholders at all levels. Identify risks early and facilitate solutions to challenges during ASIC design, hardware integration, and software deployment. Establish and track KPIs to measure progress and maintain program momentum. Adapt plans and guide teams through shifting priorities without losing pace. Develop and refine program management frameworks to support high-performing engineering teams. What We Look For Strong technical background, with the ability to understand and communicate about complex infrastructure projects. Exceptional organizational and communication skills. Experience coordinating across multiple teams and external vendors. Proactive approach to identifying and addressing risks. Comfort working in an environment where priorities can change quickly. Location This role is based in San Jose.

Apr 17, 2026
Apply
Efficient Computer logoEfficient Computer logo
Full-time|$160K/yr - $220K/yr|On-site|San Jose, CA OR Pittsburgh, PA OR Austin,TX

Efficient Computer is pioneering the development of the world's most energy-efficient general-purpose computer processor. Our innovative, patented technology consumes 100 times less energy than the leading ultra-low-power processors available on the market. With the capability to be programmed using standard high-level programming languages and AI/ML frameworks, our groundbreaking efficiency allows IoT devices to operate AI/ML continuously on a single AA battery for 5-10 years. This exceptional performance empowers devices to intelligently gather and curate first-party data, driving a new computing revolution.We are currently looking for a CAD Hardware-Software Infrastructure Engineer to take ownership of and expand our hardware design and software build/CI infrastructure. This hybrid role merges CAD infrastructure management with DevOps responsibilities. You will ensure the smooth operation of software systems for our compiler and runtime teams while overseeing the hardware toolchain essential for ASIC development, including PDK installations, EDA licensing, and third-party IP integration.If you thrive on enhancing engineer productivity, enjoy diagnosing and solving intricate build issues, and are not deterred by the occasional 2 AM encounter with a vendor license server, then this opportunity is perfect for you. Join us in shaping the future of computing at the edge and beyond!

Feb 24, 2026
Apply
Western Digital Corporation logoWestern Digital Corporation logo
Full-time|On-site|San Jose

Western Digital Corporation seeks a Senior Sales Representative in San Jose to focus on AI infrastructure and high-performance computing (HPC). This role centers on expanding the customer base and supporting organizations that depend on advanced technology solutions. Key responsibilities Drive sales efforts for AI infrastructure and HPC products Build and sustain relationships with clients and partners Contribute to Western Digital’s commitment to delivering advanced technology solutions Location This position is based in San Jose.

Apr 28, 2026
Apply
Roku, Inc. logoRoku, Inc. logo
Full-time|$280K/yr - $380K/yr|On-site|San Jose, California

Collaboration Fuels Innovation. Join Roku in Revolutionizing Television.As the leading TV streaming platform across the U.S., Canada, and Mexico, Roku is on a mission to enhance every television experience worldwide. We pioneered the streaming revolution, connecting viewers with the content they adore, empowering publishers to engage vast audiences, and offering advertisers unique opportunities to reach consumers effectively.From day one at Roku, you will play a crucial role in our success. We're a rapidly growing public company where every individual is empowered to contribute. Here, you will have the chance to delight millions of TV streamers globally while gaining invaluable experience across various disciplines. About the TeamOur DevOps/SRE team operates a robust, multi-cloud platform across AWS and GCP, ensuring that our mission-critical systems remain highly available, secure, and optimized for performance at internet scale. We prioritize reliability and automation, engineering systems that excel under pressure and evolve continuously. Our engineers take ownership of outcomes from start to finish, managing priorities, maintaining clear communication with both technical and non-technical stakeholders, and delivering impactful results organization-wide. If you thrive on architecting solutions at scale, automating processes, and converting complex infrastructure into reliable systems, you'll find a welcoming home here. About the RoleWe are on the lookout for an accomplished DevOps/SRE (Site Reliability Engineering) Senior Software Engineer to strengthen our dynamic team. The ideal candidate will possess a solid background in DevOps practices, cloud infrastructure management, and automation, along with strong team leadership capabilities. If you have a proven history of designing and constructing large-scale systems, relish tackling challenging system issues at internet scale, and have a blend of learning, organizing, and building skills, this position could be an exceptional fit for you! For California Only: The estimated annual salary for this position ranges from $280,000 to $380,000. Compensation packages vary based on individual candidate factors, including expertise and experience.

Mar 5, 2026
Apply
WeRide logoWeRide logo
Full-time|On-site|San Jose, CA

Founded in 2017, WeRide (NASDAQ: WRD) stands at the forefront of the autonomous driving revolution, developing cutting-edge technologies across Levels 2 to 4. As the only technology firm globally with driverless permits in China, the UAE, Singapore, and the United States, WeRide conducts extensive R&D, testing, and operations in over 30 cities across 10 countries. With a proven track record, WeRide has maintained a self-driving fleet for more than 2,200 days.WeRide.ai is seeking a talented Senior Software Engineer to join our team in building robust and efficient cloud infrastructure platforms for autonomous driving. This includes developing PaaS solutions such as Big Data, AI, Simulation, and IaaS platforms including Kubernetes and storage services.

Aug 2, 2023
Apply
Archer logoArcher logo
Full-time|$170K/yr - $215K/yr|On-site|San Jose, California, United States

Archer is a pioneering aerospace company located in San Jose, California, dedicated to developing an innovative all-electric vertical takeoff and landing aircraft. Our mission is to enhance the advantages of sustainable air mobility through cutting-edge design, manufacturing, and operation of aircraft that can transport four passengers with minimal noise impact.We aspire for greatness, tackling challenging issues head-on, and firmly believe that a diverse workforce enhances our intelligence, drives superior insights, and ultimately leads us all towards success. We are committed to fostering an equitable and inclusive workplace that values our differences and supports every team member.What you’ll do:Design, develop, and sustain Continuous Integration and Continuous Deployment (CI/CD) workflows and automated testing pipelines for the GNC and Vehicle Simulation teams using tools such as GitHub Actions and TeamCity.Assist in creating and maintaining automated toolchains for generating, compiling, and deploying Simulink and MATLAB code across various environments, including desktop simulations, Hardware-in-the-Loop (HIL) setups, and flight computers for both subscale and full-scale vehicles.Architect new, and oversee existing, High-Performance Computing (HPC) infrastructure to facilitate extensive Monte Carlo simulations, batch regression testing, and parallelized simulations and evaluations.Oversee key software interfaces between the "Virtual Vehicle" (Simulink models) and downstream platforms, including managing Interface Control Documents (ICDs), data dictionaries, and signal mapping to embedded middleware.Create C/C++ wrapper functions, S-functions, and harnesses to ensure smooth integration of compiled code and external libraries within the MATLAB/Simulink framework.Act as a cross-functional engineering multiplier: Utilize your expertise in aerospace principles to contribute directly to vehicle simulation modeling, physics tooling, or GNC integration tasks as required.Serve as the primary technical liaison between the Flight Dynamics & Control (FD&C) team and the Platform embedded software team, ensuring the accurate translation of complex algorithms into real-time execution.

Feb 27, 2026
Apply
Spec logoSpec logo
Full-time|$160K/yr - $180K/yr|On-site|San Jose, CA

Be a part of a pioneering team that's transforming how major brands combat online fraud. Our innovative technology proactively identifies cyber threats by analyzing live internet traffic, helping online ticket vendors, retailers, and marketplaces detect fraudulent users and bots. Spec is on the lookout for a Senior Software Engineer - Cloud Infrastructure to help us build the cloud platforms and infrastructure that are changing the landscape of online fraud prevention.As a member of our DevOps Engineering team, you will play a crucial role in shaping the strategy, architecture, implementation, and operation of Spec's platform. You will collaborate closely with product development and customer success teams to design the tools, systems, and processes that empower us to safeguard the largest enterprises on the internet. Join a tight-knit team of professionals to align on strategic vision and execute tactical plans for our groundbreaking technology platform.

Mar 31, 2025
Apply
OKX logoOKX logo
Full-time|On-site|San Jose, California, United States

Join OKX as a Principal Engineer specializing in Agent Infrastructure and Memory Architecture. In this pivotal role, you will lead cutting-edge projects aimed at enhancing our infrastructure and optimizing memory architecture, ensuring the best performance for our operations. Collaborate with cross-functional teams to innovate and implement solutions that drive efficiency and scalability.

Mar 16, 2026
Apply
Archer logoArcher logo
Full-time|$144K/yr - $175K/yr|On-site|San Jose, California, United States

Archer, headquartered in San Jose, California, develops all-electric vertical takeoff and landing aircraft designed for sustainable air mobility. The company’s aircraft are built to carry four passengers and operate with minimal noise. Archer encourages ambitious problem-solving and values a diverse, inclusive workplace where every team member is supported. Role overview The Senior Platform Engineer joins the platform engineering group and plays a key role in accelerating product development and streamlining operations. This position focuses on integrating development, infrastructure, and AI/ML projects. Major responsibilities involve automation, creating self-service infrastructure, and improving workflow efficiency. What you will do Integrate development, infrastructure, and AI/ML initiatives to support product and operational goals Automate processes and build tools for self-service infrastructure Identify and optimize workflows to increase efficiency across teams Requirements Extensive engineering experience, especially in platform or infrastructure roles Strong interest in supporting colleagues and solving complex, cross-disciplinary challenges

Apr 22, 2026
Apply
OKX logoOKX logo
Full-time|On-site|San Jose, California, United States

Join OKX as a Principal AI Engineer specializing in AI Agent Development. In this pivotal role, you will spearhead the design and implementation of cutting-edge AI agents that drive automation and enhance user experience. Collaborating with cross-functional teams, you will leverage your expertise in artificial intelligence to innovate and propel our technology forward.

Mar 16, 2026
Apply
Software Mind logoSoftware Mind logo
Full-time|On-site|San Jose

Join our innovative team at Software Mind as a Senior Fullstack AI Engineer. In this pivotal role, you will leverage your expertise in both front-end and back-end development to create cutting-edge AI solutions. Collaborating with a talented group of engineers and data scientists, you will drive the development of scalable applications that harness the power of artificial intelligence.

Apr 30, 2026
Apply
Lumilens logoLumilens logo
Full-time|On-site|San Jose

About LumilensAt Lumilens, we are pioneering the vital photonics infrastructure that powers the future of AI supercomputing. From chip-to-chip optical interconnects to scalable photonic engines, we are ushering in a new era of computing that is faster, cooler, and significantly more efficient.As a well-funded startup supported by Mayfield, we are led by industry veterans who have successfully built and scaled transformative technologies. At Lumilens, we are focused on developing high-speed photonics products specifically designed to enhance the future of AI infrastructure and high-performance computing.This is not just incremental innovation; it represents a ground-floor opportunity to fundamentally rethink the optical layer from the silicon level. You will collaborate with a team of world-class engineers tackling some of the most challenging problems in scaling optical systems. Every line of code, every design decision, and every breakthrough you contribute will help shape the infrastructure of tomorrow.If you are seeking a mission-driven environment, rapid momentum, and the opportunity to make a significant impact, join us on this exciting journey. We’re just getting started.About the Role – Optical Product Test EngineerIn your role as an Optical Product Test Engineer, you will be essential in validating and ensuring the quality of Lumilens’ photonic components, modules, and systems. This position merges hands-on production testing with the development of testing protocols and reporting, supporting both new product introductions (NPI) and volume manufacturing.You will closely collaborate with optical design, hardware, and manufacturing teams to perform calibration, characterization, and troubleshooting of next-generation photonics technologies that are set to drive the future of AI and high-performance computing infrastructure.Primary Duties & ResponsibilitiesCharacterize and test fiber optic components and modulesAnalyze test results to ensure adherence to performance specifications and quality standardsMaintain precise and detailed records of testing procedures and outcomesGenerate detailed test reports for both internal and external stakeholdersAssist in troubleshooting and root cause analysis to resolve product issuesDesign, plan, and conduct optical and fiber testing on components, devices, and systemsCollaborate with cross-functional engineering teams to support NPI and the ramp-up of manufacturing

Oct 2, 2025
Apply
Etched logoEtched logo
Full-time|On-site|San Jose

About EtchedEtched is pioneering the first AI inference system specifically designed for transformers, offering over 10x greater performance and significantly reduced cost and latency compared to traditional solutions like the B200. Our advanced ASICs empower developers to create groundbreaking products that were previously unattainable with GPUs, including real-time video generation models and highly complex reasoning agents. Supported by substantial investment from prestigious investors and a team of top-tier engineers, Etched is reshaping the infrastructure landscape for the most rapidly evolving industry in history.Job SummaryJoin our Core Engineering team, where you will play a critical role in developing the products that will drive a significant portion of global AI inference. Collaborating closely with our founders, you will swiftly identify key organizational challenges and implement innovative solutions, often achieving results in days rather than weeks. Your responsibilities will be diverse; one week, you may create internal tools to support a hardware team, and the next, you could be prototyping a deployment for a client or establishing a new workflow.This position is ideal for engineers who thrive in dynamic environments, prioritize speed without compromising quality, and derive motivation from solving urgent challenges. If you are eager to develop and deploy production-level code regularly, this role is a perfect fit for you.The role demands exceptional full-stack engineering skills, a strong product sense, and a proactive approach to building solutions from the ground up. At Etched, we are in the process of scaling our operations, and we are searching for engineers ready to tackle our most formidable challenges. The issues we address are urgent and critical to our business. Be prepared for a role that goes beyond the traditional 9-5; the perfect candidate will find excitement in this fast-paced environment.

Jul 26, 2025
Apply
Tessera Labs logo
Full-time|Remote|San Jose Office

Location: San Jose, CA or New York CityRemote: Considered; travel requiredAbout Tessera Labs: Tessera Labs is at the forefront of transforming how enterprises implement and leverage Artificial Intelligence. Supported by Foundation Capital and driven by a top-tier founding team, we develop multi-agent AI systems that streamline complex business workflows across major platforms such as SAP, Salesforce, Workday, Snowflake, MuleSoft, and beyond.Our Mission: Our goal is to deliver genuine AI automation to enterprises with speed, precision, and measurable outcomes. We prioritize agility, take ownership, and innovate at the cutting edge of applied AI.Why This Role is Essential: This position empowers Forward Deployment Engineers (FDEs) to facilitate swift and secure AI-driven ERP modernization. You will play a vital role in enhancing migration speed, ensuring operational continuity, and enabling data-informed decision-making. Your work will lay the groundwork for enterprise-scale AI and analytical solutions in complex environments, positioning you at the forefront of enterprise AI and ERP transformation.Role Overview: As a Data Engineer, you will collaborate closely with Forward Deployment Engineers (FDEs) to drive rapid ERP modernization and AI transformation for our enterprise clients. Your primary focus will be on data harmonization, cross-system integration, and developing data pipelines, ensuring that AI solutions and enterprise workflows are supported by high-quality, reliable, and well-structured data.This role requires expertise in ETL processes, relational schema modeling, data mapping, data cleaning, and pipeline logic for structured/tabular data. A lightweight MLOps component may be involved, focusing on structured datasets and potentially requiring distributed processing with PySpark or ML data engineering techniques. Note that there are no downstream responsibilities concerning model training, serving, or deployment.Candidates should bring a deeper understanding of ERP-centric data compared to typical ML data engineering roles, along with robust generalist engineering skills to construct scalable, production-grade pipelines. Ideal candidates will have expertise in SAP data coupled with modern data engineering or machine learning enablement experience; strengths in one area with the willingness to learn the other are acceptable.Key Responsibilities:Data Harmonization: Integrate, reconcile, and standardize structured data across ERP, CRM, finance, and analytical systems.

Dec 15, 2025

Sign in to browse more jobs

Create account — see all 641 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.