Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Experience
Qualifications
What You Will DoDesign, develop, and enhance the Browserbase Core platform, creating robust and scalable distributed backends with user-friendly APIs. Collaborate closely with the Engineering team to gather insights and provide exceptional support, enabling everyone to build on Core with confidence. Define, scope, and assess key projects; prioritize the roadmap; and sequence deliverables to drive platform advancement. Establish and promote best practices for development, operations, and reliability. Continuously improve the platform to accommodate the growing customer base and demand. Investigate, troubleshoot, and resolve operational challenges that occur in production. Document your processes and share knowledge with the team. Required Technical SkillsExtensive experience in building and scaling distributed systems, managing hundreds or thousands of instances. Strong proficiency in Go or Typescript; familiarity with Firecracker or similar Virtual Machine Monitors (VMMs) is a plus. Knowledge of CI/CD pipelines, Kubernetes, Docker, message queues, relational databases, automated testing, performance optimization, and zero-downtime multi-region deployments. Possess a systems-thinking mindset, understanding how infrastructure decisions impact customer experience.
About the job
At Browserbase, we revolutionize web browsing for AI agents and applications. Our innovative headless browser infrastructure automates interactions with websites, simplifies form filling, and replicates user actions seamlessly.
Having successfully raised a $40M Series B last year, we are on an accelerated growth trajectory. Supported by esteemed investors such as Kleiner Perkins, CRV, and Notable Capital, our dynamic team is committed to realizing our CEO's vision for empowering the best AI tools and transforming web automation.
Our Core Infrastructure team is essential for maintaining the efficiency of our operations. This group tackles significant distributed systems challenges, ensuring our platform's speed, reliability, and scalability.
About Browserbase
Browserbase is at the forefront of web automation technology, providing innovative solutions for AI-driven applications. With substantial backing from prominent investors and a clear vision for the future, we are dedicated to making web interactions seamless and efficient.
About Lumafield: Established in 2019, Lumafield has pioneered the development of the world's first accessible X-Ray CT scanner specifically designed for engineers. Our intuitive scanner, combined with cloud-based software, empowers engineers to gain unparalleled insights into their projects at a remarkably affordable cost. Engineers face high-stakes decision…
At Gimlet Labs, we are pioneering the first heterogeneous neocloud tailored for AI workloads. As AI technology evolves, the industry confronts critical limitations in power, capacity, and cost linked to the traditional homogeneous, vertically integrated infrastructure. Gimlet addresses these challenges by decoupling AI workloads from the fundamental hardware, intelligently partitioning them into components and orchestrating each to the hardware that best meets its performance and efficiency needs. This innovative approach facilitates heterogeneous systems across diverse vendors and generations of hardware, including the latest emerging accelerators, resulting in significant improvements in performance and cost efficiency at scale.Building upon this platform, Gimlet is developing a production-grade neocloud for agentic workloads. Our customers can deploy and manage their workloads through stable, production-ready APIs without the complexities of hardware selection, placement, or low-level performance optimization.Gimlet collaborates with foundational labs, hyperscalers, and AI-native companies to enable real production workloads designed to scale to gigawatt-class AI datacenters.We are currently in search of a Technical Staff Member specializing in distributed systems. In this role, you will be instrumental in developing the core platform responsible for scheduling, routing, and managing AI workloads reliably at production scale. You will engage with systems that coordinate execution across thousands of nodes, provide stable production APIs, and guarantee predictable workload performance under real-world conditions of load and failure.This position is ideal for engineers passionate about building foundational infrastructure, grasping end-to-end systems, and operating at scale.
About Us:Aurelius Systems is a venture capital-backed startup at the forefront of defense technology, specializing in the development of autonomous, edge-deployed robotic systems utilizing directed energy for counter-unmanned aerial systems (UAS).Our innovative approach involves creating laser systems designed to neutralize drones.With a dedicated team of approximately 10 engineers, former U.S. military personnel, and industry experts, we are committed to advancing America's capabilities in directed energy technology, delivering the first cost-effective and reliable laser weapon systems.Inspired by the philosophy of Marcus Aurelius, we emphasize consistent effort and accountability in our work, embodying a culture of high output without excuses. Following in the footsteps of pioneers like Henry Ford, we embrace innovation and action within our small but impactful team.In addition to our San Francisco headquarters, we are proud to operate a manufacturing hub in Detroit and conduct field tests weekly on our expansive private range.If you thrive on seeing your engineering contributions directly in action rather than being confined to a lab, we encourage you to explore this opportunity.The Position & Your Contribution:As a Robotics Software Systems Engineer, your primary responsibility will be to ensure that all subsystems function seamlessly and efficiently together.Our system comprises a complex array of subsystems including sensing, computer vision, machine learning inference, control systems, power management, and mechanical actuation. Achieving minimal processing time and inter-process latency is crucial for successfully targeting our nimble and evasive UAS.The key area we are looking to fill is real-time systems performance at the hardware interface. You should possess a deep understanding of how software execution impacts physical system behavior, how latency accumulates across CPU, GPU, memory, and I/O, and how bandwidth limitations influence sensor data processing. We need an engineer who is detail-oriented, considering microseconds, memory bandwidth, cache behavior, and system determinism.In our tight-knit team of around 10 engineers, you will have the opportunity to take ownership of systems that are field-tested. The success of our tests is binary—it's either effective or it isn't—and your role will involve iterative improvement based on real-world outcomes.Your Responsibilities:Manage the latency budget for the entire platform, from data sensing to actuation.Profile and mitigate latency across CPU, GPU, memory, and I/O interfaces.Develop and optimize kernels for high-throughput, low-latency operations.Adjust memory access patterns for optimal performance.
About Our TeamJoin our dynamic Core Services team, where we play a pivotal role in developing and overseeing essential services that underpin our operations. We connect core infrastructure—such as compute, storage, and networking—with product engineering teams, empowering them to innovate quickly, build with dependability, and scale effectively.Role OverviewAs a Software Engineer on our Core Services team, you will be instrumental in designing and managing vital backend platforms, including caching systems, workflow orchestration, metadata storage, and file services. Your focus will be on crafting highly reliable, scalable, and efficient systems that serve as the foundation of our product offerings.We seek individuals who are passionate about developing infrastructure that enhances product teams, relish tackling challenges in distributed systems, and take pride in creating well-architected APIs and abstractions that streamline development processes.This position is based in San Francisco, CA, utilizing a hybrid work model of three days in the office each week, with relocation assistance provided to new hires.Your Responsibilities Will Include:Designing, building, and maintaining shared infrastructure services, including caching layers, workflow orchestration (e.g., Temporal), metadata stores, and file storage solutions.Collaborating with product teams to deliver scalable and reliable primitives that simplify the complexities of distributed systems.Enhancing the performance, resilience, and scalability of core services that underpin our customer-facing applications.You Will Excel in This Role If You:Possess experience with distributed systems, caching frameworks (e.g., Redis, Memcached), metadata storage solutions (e.g., FoundationDB), or workflow orchestration tools (e.g., Temporal, Cadence).Have experience managing containerized services within cloud environments and integrating them into automated CI/CD workflows.Comprehend the trade-offs involved in consistency models, replication strategies, and performance optimization for multi-region systems.Exhibit strong communication skills and thrive in collaborative environments, with a fervent commitment to delivering exceptional customer outcomes.About OpenAIOpenAI is a pioneering organization in AI research and deployment, dedicated to ensuring that artificial intelligence serves the greater good of humanity. We strive to make transformative technologies accessible and beneficial for all.
Full-time|On-site|San Francisco, CA; Sunnyvale, CA; Seattle, WA
Join DoorDash as a Staff Software Engineer specializing in Data Engineering, where you will play a critical role in designing and implementing data solutions that drive business insights and enhance operational efficiency. You will collaborate with cross-functional teams to create robust data pipelines and leverage cutting-edge technology to manage large-scale datasets.
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY
Join Scale AI's innovative team as an Infrastructure Software Engineer for our Enterprise Generative AI Platform (SGP). In this dynamic role, you will help design and enhance our enterprise-grade AI platform, which offers robust APIs for knowledge retrieval, inference, evaluation, and more. We're seeking an exceptional engineer who thrives in fast-paced environments and is eager to contribute to the scaling of our core infrastructure. The ideal candidate will possess a solid foundation in software engineering principles and extensive experience with large-scale distributed systems. Your role will involve implementing solutions across various cloud providers (GCP, Azure, AWS) for clients in highly regulated sectors, including healthcare, telecommunications, finance, and retail.
Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California, dedicated to advancing technology across various modalities.About the Position:We are seeking a Data Engineer - Multimodal Systems to play a pivotal role in the enhancement and expansion of Zyphra's datasets and data pipelines. This position offers a unique opportunity to collaborate with diverse teams and contribute to innovative data solutions. You will engage in the collection of extensive datasets and the development and optimization of high-performance parallel data pipelines.Your Responsibilities Will Include:Executing large-scale data collection across multiple modalities, including text, audio, and image.Designing and implementing highly efficient, parallelized data processing pipelines that integrate various modalities.Conducting rigorous experimental ablations to evaluate the effectiveness of new data enhancements.Candidate Requirements:Proven ability in implementation and prototyping.Capability to transform ideas into experimental frameworks swiftly.Strong collaborative skills, thriving in a dynamic research environment.Eagerness to learn and apply new concepts effectively.Exceptional communication and teamwork skills, capable of contributing to both research and large-scale engineering projects.Preferred Qualifications:Experience in the collection, management, and processing of large datasets.Familiarity with parallel programming frameworks in Python, such as Dask.In-depth understanding of state-of-the-art dataset curation practices.A detail-oriented mindset with a passion for data integrity and verification.Strong foundation in experimental methodologies for conducting thorough ablation studies and hypothesis testing.Knowledge and interest in large-scale, highly parallel data processing systems.Proficiency in PyTorch and Python.Experience with large, complex codebases and the ability to quickly become productive within them.Published research in respected machine learning venues.Postgraduate degree in a relevant field is a plus.
Full-time|$130.6K/yr - $235K/yr|On-site|San Francisco, CA; Sunnyvale, CA
About Our TeamAt DoorDash, data drives our success. Our Data Engineering team is pivotal in building robust database solutions tailored for diverse applications, including reporting, product analytics, marketing optimization, and financial reporting. By architecting pipelines, data structures, and data warehouse environments, we enable data-driven decision-making across the organization.About the RoleWe are seeking a talented Software Engineer II to join our team as a technical leader, responsible for scaling our data infrastructure, enhancing automation, and developing tools to support our expanding business needs.What You Will DoCollaborate with business partners and stakeholders to gather and understand data requirements.Work alongside engineering, product teams, and external partners to ensure seamless data collection.Design, develop, and implement high-performance data models and pipelines for our Data Lake and Data Warehouse.Establish and execute data quality checks, conduct thorough QA, and implement monitoring routines.Enhance the reliability and scalability of our ETL processes.Manage a suite of data products that deliver accurate and trustworthy data.Support and onboard new engineers as they join our team.What We Are Looking For3+ years of professional experience in data engineering, business intelligence, or a related field.Proficiency in programming languages such as Python and Java.3+ years of experience with ETL orchestration and workflow management tools, including Airflow, Flink, Oozie, and Azkaban, using AWS/GCP platforms.Strong understanding of database fundamentals, SQL, and distributed computing.3+ years of experience with distributed data ecosystems (e.g., Spark, Hive, Druid, Presto) and streaming technologies like Kafka and Flink.Experience with Snowflake, Redshift, PostgreSQL, and/or other database management systems.Excellent communication skills with a proven ability to liaise with both technical and non-technical teams.Familiarity with reporting tools such as Tableau, Superset, and Looker.Able to thrive in a fast-paced and dynamic environment.
Location: San Francisco, CA (Hybrid: 4 days onsite/week). Relocation assistance available.About Our Team:At OpenAI, we are at the forefront of technology, creating foundational platform software that ensures our consumer products are reliable, secure, and high-performing. Our team collaborates across various system layers, working closely with engineering partners to deliver exceptional capabilities from initial concept to final launch.Role Overview:We are looking for a passionate Systems Software Engineer to lead the design, implementation, and debugging of critical platform components and the pipelines that build and update system images. Your focus will span across operating system layers, emphasizing performance optimization, security enhancements, and in-depth system debugging to deliver production-grade systems that exceed expectations.Key Responsibilities:Design and develop robust system-level components and services within both kernel and user spaces.Configure and maintain essential OS platform services (init, services, networking, security policies) and related tools.Build and manage image and update pipelines, ensuring their reliability, reproducibility, and rollback safety.Instrument system performance through profiling and tracing; enhance CPU, memory, I/O, and energy efficiency.Oversee platform observability and reliability, including logging, crash capture, watchdogs, and diagnostics.Collaborate with cross-functional teams to define interfaces and deliver comprehensive end-to-end features.Establish and promote strong engineering practices such as code reviews, continuous integration, reproducible builds, and effective release management.Work alongside external vendors to support builds and deployments.You Will Excel in This Role If You:Have successfully launched production systems software on modern operating systems.Possess proficiency in C/C++ and a scripting language, with a strong understanding of OS internals including concurrency, memory management, filesystems, networking, and power management.Demonstrate exceptional systems debugging skills utilizing debuggers, tracers, profilers, and logs across kernel/user-space boundaries.Comprehend the configuration of platform services and interfaces, effectively translating requirements into stable, well-documented APIs.Are knowledgeable about user-space foundations including service management, IPC, networking, packaging, and automation.Have experience collaborating with external partners to deliver high-quality software solutions.
About CondorAt Condor, we are transforming the financial infrastructure of clinical development. While substantial investments are made annually to discover and develop new therapies, the processes behind these advancements often remain outdated and disconnected. Our mission is to bridge this gap, creating a cohesive system that integrates clinical operations, vendor activities, and financial data into a real-time intelligence layer. This empowers R&D and finance teams with the insights they need to make informed decisions.Our AI-driven, pharma-native infrastructure is designed to scale industry standards that we have helped shape alongside major partners. We facilitate prediction, control, and execution in some of the most complex R&D environments globally.As we continue to gain the trust of enterprise teams, we are now focused on the critical task of scaling our operations in a high-stakes environment.Condor is a rapidly growing company, backed by leading institutional investors such as Felicis and 645 Ventures, collaborating with top 200 biopharma companies. This is a unique opportunity to contribute to the infrastructure that influences how new therapies reach patients.The RoleWe are seeking a Senior Backend and Data Platform Engineer to play a key role in developing the foundational data infrastructure for Condor’s financial intelligence platform. This position is pivotal in turning complex clinical and financial data into actionable intelligence that enterprise biopharma teams can rely on.In this role, you will be responsible for designing and managing the core data foundations that underpin Condor’s financial engine and AI capabilities. Your work will involve modeling intricate, high-stakes data, constructing reliable data pipelines and services, and ensuring that product features and intelligence workflows function with precision, consistency, and scalability. The systems you develop will directly support critical finance and operational applications.This hands-on, senior engineering position provides you with significant ownership. You will engage with backend services, data pipelines, and APIs, bringing features from concept to production. You will define necessary data schemas, transformations, and architectural patterns that become essential as our platform evolves. Although your primary focus will be on backend and data engineering, you will also be encouraged to work across the stack to ensure seamless integration of data and intelligence.
Role Overview sfcompute is hiring a Software Engineer focused on ETL and Data in San Francisco, CA. This position centers on building and maintaining data pipelines that turn raw data into actionable insights. What You Will Do Design and implement ETL processes to move and transform data efficiently Work with teams across the company to improve data quality and accessibility Support data-driven decision-making by ensuring reliable and accurate information is available
Full-time|$191K/yr - $225K/yr|On-site|United States
Founded in 2007, Airbnb began its journey when two hosts welcomed three guests into their San Francisco home. Today, we boast a thriving community of over 5 million hosts who have welcomed more than 2 billion guest arrivals across nearly every country worldwide. Our hosts provide exceptional stays and unique experiences, enabling guests to connect with local communities in a genuine and meaningful way.Join Our Community:At Airbnb, we prioritize the importance of reliable data across all business sectors to drive insight and innovation. To achieve this, we focus on understanding business needs, securing appropriate data sources, designing effective data models, and establishing robust and dependable data pipelines.We are currently recruiting for the following teams:The Data Stewardship Team: A dedicated group of data enthusiasts with diverse expertise in analytics, data modeling, governance, compliance, and scalable data quality. Our mission is to ensure that Airbnb meets its compliance obligations within our data ecosystem while enabling data consumers to easily find the best data suited for their needs. As part of the overall Data Infrastructure organization, we manage the online and offline data infrastructure and oversee the processes that facilitate data transitions between these environments.The Users and Contextualization Data & AI Team: A crucial component of the Marketplace Data & AI, this team focuses on developing foundational data systems that provide deeper insights into essential domains. Specifically, we concentrate on user data (Guests & Hosts) to create high-quality, well-governed user data and insights. These insights are vital for crafting personalized and context-aware experiences that enhance trip quality both on and off the Airbnb platform, ultimately enabling Airbnb to better understand and serve its users throughout their journey.Your Impact:Data Stewardship: This is integral to Airbnb's operations. High-quality data is imperative for our business decisions and the future of our AI initiatives. We are responsible for the overall strategy regarding data quality, identifying critical data and its provenance, measuring the effectiveness of our internal data products, and collaborating with our core catalog team to provide optimal data solutions.
Full-time|$180K/yr - $250K/yr|On-site|San Francisco
About Unto LabsAt Unto Labs, we are a dedicated team of low-level engineers pushing the boundaries of distributed systems to harness the full potential of modern hardware. Our mission is to revolutionize blockchain technology, innovating from fundamental consensus mechanisms to finely-tuned networking stacks, serving applications that scale globally.About the RoleWe are on the lookout for a talented Systems Engineer to tackle the most complex challenges in distributed, high-performance computing. In this role, you will design and develop critical components of a base-layer blockchain, aiming to extract maximum performance from standard hardware. As part of our dynamic team, you will engage with the entire modern blockchain stack, including cryptography, consensus, distributed systems, and core kernels.Key ResponsibilitiesDesign robust systems ensuring effective process isolation.Develop and test high-performance subsystems from the ground up in C, with a focus on cryptography, consensus algorithms, and networking.Implement sophisticated error correction and data recovery solutions.Contribute to and enhance a suite of custom benchmarking and performance testing tools.QualificationsExtensive knowledge of systems programming languages (C, C++, Rust) with an emphasis on performance optimization.Strong understanding of contemporary CPU architectures, memory hierarchies, and low-level optimization strategies.Expertise in high-performance networking and protocol design.Familiarity with distributed systems and consensus algorithms.Why Join Us?Be at the cutting edge of blockchain technology, working on next-generation performance enhancements within a small, agile, and collaborative team.Collaborate with seasoned experts in high-frequency trading and the development of scalable ecosystems.Contribute to the architecture of a system capable of processing millions of transactions per second.Gain access to state-of-the-art hardware and development tools.Compensation and BenefitsCompetitive salary ranging from $180,000 to $250,000+ USD.Significant equity and potential for growth.A flexible working environment.
About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technology, responsible for creating, deploying, and maintaining some of the world's largest supercomputers. These supercomputers are pivotal for training our most advanced AI models, pushing the boundaries of innovation.We transform sophisticated data center designs into operational systems and develop the software infrastructure necessary for extensive frontier model training. Our goal is to ensure these hyperscale supercomputers operate reliably and efficiently, supporting groundbreaking AI research.About the RoleAs a key member of the Frontier Systems team, you will be instrumental in designing the critical infrastructure that ensures our supercomputers function seamlessly for pioneering AI research. In this role, you'll address system-level challenges and implement automation solutions that minimize disruptions during large-scale training processes.Your responsibilities will encompass end-to-end ownership of your projects, allowing you to make significant contributions to our mission. This position is ideal for individuals who excel in diagnosing complex system issues and crafting automation strategies to proactively resolve problems across a vast network of machines.Your Responsibilities Include:Enhancing system health checks to maintain the stability of our hyperscale supercomputers during model training.Conducting in-depth investigations into hardware failures and system-level bugs to uncover root causes.Developing automation tools that monitor and resolve issues across thousands of systems, enabling uninterrupted research progress.You May Be a Great Fit If You Possess:7+ years of hands-on experience in software engineering.Strong proficiency in Python and shell scripting.Expertise in analyzing complex data sets using SQL, PromQL, Pandas, or other relevant tools.Experience in creating reproducible analyses.A solid balance of skills in both building and operationalizing systems.Prior experience with hardware is not a prerequisite for this role.Preferred Qualifications:Familiarity with the intricacies of hardware components, protocols, and Linux tools (e.g., PCIe, Infiniband, networking, power management, kernel performance tuning).Experience with system optimization and performance tuning.
Company Overview:Specter is revolutionizing how businesses perceive their physical environments by developing a software-defined control plane. Our mission is to enhance the security of American enterprises by providing them with comprehensive visibility over their physical assets.We are pioneering a connected hardware-software ecosystem that leverages multi-modal wireless mesh sensing technology, reducing the deployment costs and time for sensors by a factor of ten. Our platform aims to be the perception engine for a company’s physical presence, facilitating real-time visibility of perimeters and enabling autonomous operational management.Founded by passionate innovators from Anduril, Tesla, Uber, and the U.S. Special Forces, our co-founders, Xerxes and Philip, are dedicated to empowering our partners in the rapidly evolving landscape of physical AI and robotics.
Join Cloudflare as a Systems Engineer specializing in Data, where you will play a critical role in enhancing our infrastructure and ensuring the reliability of our services. You will collaborate with cross-functional teams to design, implement, and maintain systems that handle vast amounts of data efficiently and securely. Your contributions will be pivotal in optimizing performance and delivering exceptional user experiences.
Join our innovative team at Crusoe as a Staff Software Engineer, where you will leverage your expertise in systems engineering to develop cutting-edge software solutions. In this dynamic role, you will collaborate with cross-functional teams to design, implement, and optimize systems that drive our mission forward. Your contributions will be pivotal in enhancing our technology stack and ensuring the seamless operation of our systems.
At NerdWallet, we are committed to empowering individuals to make informed financial decisions. Our team comprises exceptional individuals who thrive in an inclusive, flexible, and candid environment. Whether you choose to work remotely or in the office, we prioritize your well-being, professional development, and the impact you can make. We believe that when one of us elevates our skills, the whole team benefits.As part of NerdWallet’s Platform team, you will oversee the systems that serve as the backbone of our consumer experience. This includes management of our centralized product data platform, partner ingestion pipelines, publishing and click-tracking infrastructure, GraphQL gateway operations, and our high-traffic, headless WordPress CMS. These platforms deliver precise, compliant, and high-performance product and content experiences to millions of users on both web and mobile platforms. We are searching for a Senior Engineering Manager to lead this team in modernizing legacy services into scalable and reliable systems while advancing our vision of a decoupled, adaptable platform that facilitates quicker publishing, enhanced observability, and future growth.In the role of Senior Engineering Manager for Platform Systems, you will guide and support a team of engineers in delivering high-quality, scalable, and secure software that aligns with NerdWallet’s product and business objectives. You will collaborate closely with Product Managers and other cross-functional partners to define the roadmap, prioritize tasks, and eliminate obstacles, while nurturing strong engineering practices and a culture of continuous improvement. Your responsibilities will include ensuring technical quality, team well-being, and daily operations, while mentoring engineers, making strategic technical decisions, and balancing immediate deliverables with long-term sustainability, compliance, and reliability.This position reports to the Director of Engineering.Opportunities for Impact:Lead, mentor, and develop a high-performing engineering team responsible for NerdWallet’s platform systems, including the Content Platform, CMS, and Product Data Platform.Collaborate with Product Managers and cross-functional teams to strategize, prioritize, and execute the product roadmap.Champion consistent adherence to software development best practices, including code quality, testing, documentation, and operational excellence.Influence and guide technical and architectural decisions to ensure solutions are scalable, secure, reliable, and compliant with regulatory standards.Balance immediate project needs with long-term project vision and maintainability.
Full-time|On-site|CA - San Francisco; WA - Seattle; UT - Cottonwood Heights
Join SoFi as a Senior Software Engineer in our Data Foundations team, where you will play a pivotal role in shaping our data architecture and enhancing our data-driven capabilities. You will work closely with cross-functional teams to develop robust data solutions that empower our business decisions and improve customer experiences.As a Senior Software Engineer, you will leverage your expertise in data engineering, software development, and cloud technologies to build scalable data pipelines and maintain high-quality data infrastructure. Your contributions will directly impact our ability to deliver innovative financial solutions.
About Liquid AIOriginating from MIT CSAIL, Liquid AI specializes in the development of general-purpose AI systems designed to operate seamlessly across various platforms, including data center accelerators and on-device hardware. Our focus is on delivering low latency, efficient memory usage, privacy, and reliability. We collaborate with organizations in diverse sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek outstanding talent to join our mission.The OpportunityThe Training Infrastructure team is at the forefront of building the distributed systems that empower our next-generation Liquid Foundation Models. As our operations expand, we aim to innovate, implement, and enhance the infrastructure crucial for large-scale training.This role is centered around high ownership of training systems, emphasizing runtime, performance, and reliability rather than a typical platform or SRE function. You will collaborate within a small, agile team, creating vital systems from the ground up instead of working with pre-existing infrastructure.While San Francisco and Boston are preferred, we are open to other locations.What We're Looking ForWe are seeking an individual who:Embraces the complexity of distributed systems: Our team is dedicated to maintaining stability during extensive training runs, troubleshooting training failures across GPU clusters, and enhancing overall performance.Is passionate about building: We value team members who take pride in developing robust, efficient, and reliable infrastructure.Excels in uncertain environments: Our systems are designed to support evolving model architectures. You will be making decisions based on incomplete information and rapidly iterating.Aligns with team goals and delivers results: The best engineers on our team align with collective priorities while providing data-driven feedback when challenges arise.The WorkDesign and develop core systems that ensure quick and reliable large training runs.Create scalable distributed training infrastructure for GPU clusters.Implement and refine parallelism and sharding strategies for evolving architectures.Optimize distributed efficiency through topology-aware collectives, communication/compute overlap, and straggler mitigation.Develop data loading systems to eliminate I/O bottlenecks for multimodal datasets.