Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Entry Level
Qualifications
The ideal candidate should possess a strong foundation in distributed systems, with experience in designing and implementing scalable applications. Proficiency in programming languages such as Go, Python, or Java is essential. A background in data analytics or real-time processing systems will be a significant advantage.
About the job
Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.
About Cloudflare, Inc.
Cloudflare is a global leader in web performance and security, helping to build a better Internet. We protect and accelerate any Internet application without adding hardware, installing software, or changing a line of code.
About Our Team:Join the innovative Database Systems team at OpenAI, where we specialize in high-performance distributed databases. We are the architects behind Rockset, a cutting-edge real-time search, analytics, and vector database that powers all vector search and retrieval augmented generation (RAG) at OpenAI. Rockset underpins core functionalities across…
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, specifically in the area of Logs and Audit Logs. In this pivotal role, you will drive the design and implementation of scalable distributed systems that enhance our data processing capabilities.As part of our innovative team, you will collaborate with cross-functional partners to optimize data flow and ensure the integrity of logs across our infrastructure. Your expertise will contribute to building a robust platform that supports analytics and monitoring for our diverse customer base.
Role OverviewJoin our innovative team as a Distributed Systems Engineer at Archil, where you will play a pivotal role in developing cutting-edge storage solutions. You will work across the entire technology stack, tackling challenges as they arise and significantly shaping our product's technical and strategic direction.Your responsibilities will include:Being on-call for our production systems to assist customers promptly in case of issues.Innovating and implementing unprecedented features in our storage services.Designing interactions within distributed systems to ensure atomicity and idempotency.Deploying and standardizing infrastructure across various cloud environments.Navigating evolving customer requirements amidst ambiguity.
Full-time|$227.2K/yr - $417K/yr|Hybrid|San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote
About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.
Role OverviewJoin Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.Key ResponsibilitiesProvide on-call support for our production systems, ensuring customer satisfaction in case of issues.Innovate and implement unprecedented capabilities within our storage services.Design interactions in distributed systems focusing on atomicity and idempotency.Deploy and generalize infrastructure across multiple cloud environments.Adapt to evolving customer needs amidst ambiguity.Lead engineering teams through complex decisions and provide insightful PR feedback.
Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.
About Our TeamThe Platform Systems team at OpenAI is at the forefront of innovation, merging advanced AI technologies with large-scale distributed systems. We are tasked with creating the engineering and research infrastructure essential for training OpenAI's premier models on some of the most powerful, custom-built supercomputers globally.Our team is dedicated to developing the core software for model training, delving deep into the technological stack. This encompasses collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we design are pivotal to enhancing OpenAI's research capabilities, facilitating reliable and efficient training at the leading edge of technology.We work in close partnership with researchers across the organization, continuously integrating insights from various OpenAI projects to advance our training platform.About the RoleAs a Software Engineer specializing in Platform Systems, you will architect and develop distributed systems that enhance visibility into large-scale training operations, ensuring their dependable operation at scale.Your responsibilities will include designing systems for failure detection, tracing, and observability that pinpoint slow or malfunctioning nodes, identify performance bottlenecks, and assist engineers in optimizing extensive distributed training tasks. This infrastructure is integral to the functionality of OpenAI's training stack and is continuously evolving to accommodate new use cases and increasingly intricate workloads.This position is central to our training infrastructure, merging systems engineering, performance analysis, and large-scale debugging.Key ResponsibilitiesDesign and develop distributed failure detection, tracing, and profiling systems tailored for large-scale AI training jobs.Create tools to identify slow, faulty, or errant nodes and deliver actionable insights into system behavior.Enhance observability, reliability, and performance across OpenAI's training platform.Troubleshoot and resolve issues within complex, high-throughput distributed systems.Collaborate effectively with systems, infrastructure, and research teams to advance platform capabilities.Adapt and expand failure detection and tracing systems to support new training paradigms and workloads.Ideal Candidate ProfilePossesses a deep passion for performance, stability, and observability in distributed systems.Demonstrates proficiency in systems engineering and performance analysis.Has experience in debugging high-throughput distributed systems.Exhibits strong collaboration skills with a track record of working with cross-functional teams.Shows adaptability and eagerness to embrace new technologies and methodologies.
About GranicaGranica is a pioneering AI research and infrastructure company dedicated to creating reliable and steerable representations of enterprise data.We build trust through Crunch, a policy-driven health layer designed to keep extensive tabular datasets efficient, reliable, and reversible. From this foundation, we are developing Large Tabular Models—systems that learn cross-column and relational structures to provide trustworthy answers and automation, complete with built-in provenance and governance.Our MissionThe current limitations of AI are not solely due to model design but also to the inefficiencies of the data that supports it. At scale, every redundant byte, poorly organized dataset, and inefficient data path contributes to significant costs, latency, and energy waste.Granica’s mission is to eliminate these inefficiencies. We leverage cutting-edge research in information theory, probabilistic modeling, and distributed systems to create self-optimizing data infrastructures that continuously enhance how information is represented and utilized by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari from Stanford University, merging advancements in information theory and learning efficiency with large-scale distributed systems. We believe that the next major breakthrough in AI will stem from innovations in efficient systems, rather than simply larger models.What You Will CreateGlobal Metadata Substrate. Design and refine the global metadata and transactional substrate that enables atomic consistency and schema evolution across exabyte-scale data systems.Adaptive Engines. Architect systems that self-optimize, reorganizing and compressing data according to access patterns, achieving unprecedented efficiency improvements.Intelligent Data Layouts. Innovate new encoding and layout strategies that challenge the theoretical limits of signal per byte read.Autonomous Compute Pipelines. Spearhead the development of distributed compute platforms that scale predictively and maintain reliability even under extreme load and failure conditions.Research to Production. Partner with Granica Research to transform advances in compression and probabilistic modeling into production-ready, industry-leading systems.Latency as Intelligence. Propel systems forward by optimizing for latency as a key aspect of intelligence.
Join Cloudflare as a Principal Systems Engineer, Data, where you will lead innovative projects that enhance our data processing capabilities. You will work collaboratively with cross-functional teams to design, implement, and optimize systems that efficiently handle large-scale data. This role requires a deep understanding of systems engineering principles, strong analytical skills, and a passion for leveraging data to drive decisions. Your contributions will be pivotal in shaping the future of our data infrastructure.
About Our TeamAt OpenAI, our Storage Infrastructure team is at the forefront of enabling data accessibility, placement, and lifecycle management through advanced APIs. We prioritize scalability, reliability, security, and usability to meet the demands of our pioneering AI research.Role OverviewWe are seeking a talented Software Engineer to join our Storage Infrastructure team, where you will architect and maintain Exascale systems designed to efficiently and reliably manage research data across multiple regions.The ideal candidate will have extensive experience in distributed systems, particularly in developing exascale data management solutions or distributed filesystems.Your ResponsibilitiesDesign and develop software solutions to manage exascale data, ensuring accessibility for researchers.Enhance the reliability, predictability, and cost efficiency of our storage systems.Collaborate with researchers to understand and address diverse data use cases.Implement robust security measures to protect our critical datasets.Ideal Candidate ProfileStrong foundation in distributed systems principles with a proven ability to design and implement scalable, reliable, and secure storage architectures.Proficiency in programming languages relevant to storage systems development.Experience with cloud platforms, particularly Azure.Familiarity with AI/ML data access patterns.A proactive approach and adaptability in a fast-paced, dynamic environment.About OpenAIOpenAI is a cutting-edge AI research and deployment organization committed to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to push the boundaries of AI capabilities while ensuring safety and human-centric development. Our mission is to encompass and appreciate diverse perspectives, voices, and experiences that reflect the full spectrum of humanity.We are proud to be an equal opportunity employer, committed to fostering an inclusive workplace where all individuals are respected and valued.
OverviewPluralis Research is at the forefront of innovation in Protocol Learning, specializing in the collaborative training of foundational models. Our approach ensures that no single participant ever has or can obtain a complete version of the model. This initiative aims to create community-driven, collectively owned frontier models that operate on self-sustaining economic principles.We are seeking experienced Senior or Staff Machine Learning Engineers with over 5 years of expertise in distributed systems and large-scale machine learning training. In this role, you will design and implement a groundbreaking substrate for training distributed ML models that function effectively over consumer-grade internet connections.
Company BackgroundAt Specter, we are pioneering a software-defined control plane that enhances the safety and visibility of physical assets for American businesses. Our innovative approach leverages a robust hardware-software ecosystem built on advanced multi-modal wireless mesh sensing technology, drastically reducing the costs and time associated with sensor deployment.Our platform is set to revolutionize the way companies perceive their physical environments, enabling real-time visibility and management of operations. With a passionate team hailing from prestigious organizations such as Anduril, Tesla, Uber, and the U.S. Special Forces, we are committed to leading the charge in the rapidly evolving realm of physical AI and robotics.Role + ResponsibilitiesWe are on the lookout for a Backend Software Engineer to join our engineering team. In this role, you will be instrumental in designing, deploying, and scaling distributed systems that power our mother node base stations and cloud services. Your contributions will ensure the reliability, scalability, and performance of Specter’s sensing and perception platform across both edge devices and cloud infrastructure.Key responsibilities include:Architecting and operating distributed, Linux-based systems across edge and cloud environments.Designing scalable pipelines for streaming, storing, and processing vast amounts of video and sensor data.Building containerized services (Docker/Kubernetes) for the deployment of perception and inference workloads at scale.Developing observability systems for monitoring, logging, and alerting across our fleet of edge nodes and cloud infrastructure.Optimizing queue-based video ingestion, transcoding, and upload pipelines for reliability in bandwidth-constrained environments.Implementing highly available infrastructure for real-time alerting, APIs, and customer-facing event systems.Collaborating cross-functionally with hardware, perception, and application teams to ensure infrastructure supports rapid iteration and scalability.
Full-time|On-site|San Francisco, Seattle, New York, Toronto
Join Stripe as a Staff Software Engineer in our Stream Compute team, where you will play a pivotal role in building scalable solutions that power the financial infrastructure of the internet. As a member of our innovative engineering team, you will leverage your expertise to design and implement robust software solutions that enhance the performance and reliability of our streaming data capabilities.
About UsAt XOXO, we are a pioneering research lab dedicated to crafting intelligent interfaces that enhance everyday life. Our stealth team, comprising passionate engineers, designers, and researchers, is committed to solving unique challenges that affect life beyond the workplace.With recent advancements in our infrastructure, architecture, and model layers, we are seeking exceptional builders to create the interface and application layers that will turn our innovative vision into reality.About the RoleWe are looking for a skilled backend engineer to design, develop, and maintain high-quality production-grade backend systems. This position is ideal for a candidate with a strong foundation in cloud infrastructure, distributed systems, observability, and data persistence, where testing and stability are central to your approach.You will be responsible for overseeing backend projects from conception to production launch, ensuring a strong emphasis on performance, reliability, infrastructure efficiency, and security throughout the entire development lifecycle.What You’ll DoDevelop and manage backend services and distributed systems in a production environment using AWS or GCP.Architect data storage and retrieval systems utilizing a balanced combination of vector, OLAP, SQL, NoSQL, caching, and search technologies.Establish and maintain observability through monitoring, logging, tracing, and alerting to ensure system health and performance.Create and execute comprehensive test frameworks that address deployment and runtime failure scenarios.Lead backend projects from start to finish, focusing on complex feature development, CI/CD, Terraform implementation, stability, and security.
Full-time|$160K/yr - $210K/yr|On-site|New York, NY, San Francisco, CA or Los Angeles, CA
The OpportunityAt Enigma, we are at a pivotal moment of growth, receiving enthusiastic feedback from our clients regarding the substantial value our product provides. This feedback drives our urgent need to effectively present the capabilities of our small business data as we expand our sales and marketing efforts.The RoleWe are seeking a skilled Senior Software Engineer to join our dynamic API and Data Delivery Team. In this position, you will be instrumental in designing, constructing, and maintaining essential systems for processing and delivering vast datasets, collaborating with both teammates and clients to address impactful, real-world challenges.What You’ll DoDevelop scalable, highly available, and high-throughput systems deployed in cloud environments.Tackle challenges involving containers, cloud infrastructure, and infrastructure as code (primarily using Docker, AWS, and Terraform).Exhibit a proactive attitude that embraces challenges, regardless of their size.Take pride in writing clean, well-tested, and maintainable code.Thrill in collaborating as part of a motivated and cohesive team.Identify and address problems that may go unnoticed by others.Be driven to create tangible impacts for our customers.Inspire your colleagues to excel while fostering a collaborative and supportive team culture.Manage responsibilities related to architecture, design decisions, hands-on implementation, team organization, and technical mentorship.What Makes This Role Exciting?Impact: Your technical expertise and decision-making will directly influence our customers and the success of our product, affecting critical choices at multi-billion dollar firms.Technical Challenge: Engage with cutting-edge technologies surrounding databases, information retrieval, distributed systems, microservices, elastic scaling, data pipelines, and more.Ownership: The API & Data Delivery team is addressing some of the world's most complex challenges. The ideal candidate is an engineer eager to expand their responsibilities and collaborate with the team to create significant technical and business impacts.
Join Crusoe as a Principal Systems Software Engineer and play a vital role in revolutionizing the tech industry. You will lead the development of innovative software solutions that enhance our systems and platforms, contributing to the overall mission of providing efficient and sustainable computing resources. Your expertise will help shape the future of our software architecture and ensure seamless integration across various applications.
Full-time|$200K/yr - $250K/yr|On-site|San Francisco, CA
At Sift, we are revolutionizing the way sophisticated machines are constructed, tested, and managed. Our innovative platform provides engineers with instantaneous visibility over high-frequency telemetry, effectively removing bottlenecks and fostering swifter, more dependable development.Originating from our extensive experience at SpaceX on projects such as Dragon, Falcon, Starlink, and Starship, Sift was created to address the challenges of scaling telemetry, debugging flight systems, and ensuring mission reliability, which necessitated the development of groundbreaking infrastructure. Established by a talented team from SpaceX, Google, and Palantir, Sift is tailored for mission-critical systems where accuracy and scalability are imperative.As a key early engineer concentrating on our data infrastructure, your role will extend beyond mere coding—you will shape foundational architecture and assist in scaling a real-time telemetry platform from its inception. You will engage with intricate backend systems designed to process, store, and deliver millions of high-frequency data points each second, facilitating rapid iteration cycles for some of the world's leading engineering teams.
About the TeamAt OpenAI, we are on a mission to develop safe and beneficial artificial general intelligence. Our models are integrated into innovative products such as ChatGPT and various APIs. To ensure these systems are swift, reliable, and economically viable, we require top-tier infrastructure that stands out in the industry.The Caching Infrastructure team plays a pivotal role by creating a robust caching layer that supports numerous critical applications at OpenAI. Our goal is to deliver a high-availability, multi-tenant caching platform capable of auto-scaling with workload demands, reducing tail latency, and accommodating a wide array of use cases.We seek an experienced engineer who can design and scale this essential infrastructure. The ideal candidate will possess extensive experience in distributed caching systems (e.g., Redis, Memcached), a solid understanding of networking fundamentals, and expertise in Kubernetes-based service orchestration.
Team and Platform Focus The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company. Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests. Role Overview This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence. What You Will Work On Working close to hardware or at the user interaction layer Developing CaaS and agent infrastructure Managing control and data planes that connect the system Bringing new supercomputing capabilities online Optimizing training workloads through profiler traces and benchmarks Improving NCCL and collective communication Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes Designing abstractions to unify diverse clusters into a single platform Areas of Expertise No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.
Join us at sfcompute, where we are revolutionizing the future by mitigating risks associated with the largest infrastructure development in history.As the demand for GPU clusters surges, financing these data centers and their supporting infrastructure has never been more critical. Our innovative approach ensures that financing is secured through long-term contracts, providing peace of mind to both lenders and developers.In the fast-paced world of AI and compute resources, we are creating a liquid market for GPU offtake, allowing even small startups to access high-end computing power without the burdens of traditional financing.About the RoleAs a Systems Software Engineer at sfcompute, you will be instrumental in developing a GPU market that brings the advanced software capabilities of hyperscalers to our innovative GPU neoclouds. Your responsibilities will encompass provisioning and monitoring bare metal servers with our virtualization orchestration software, as well as collaborating with our GPU marketplace to facilitate user configurations of VMs, networks, and storage.Key tasks include creating and maintaining a Linux OS image tailored for our tools, ensuring consistent deployment across nodes with specific data-center adjustments, and designing the API protocols and servers for user interaction.Our primary programming language is Rust, which enables us to write efficient code across all system layers, from web servers to kernel coordination. If you are familiar with memory-managed languages like C and possess experience in higher-level programming, we encourage you to apply.