Software Engineer, Distributed Systems
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Qualifications
About Figma
Figma is a cutting-edge design platform revolutionizing how teams collaborate and create. Our mission is to democratize design, making it accessible to all. With Figma, teams can brainstorm, prototype, and iterate on designs in real time from anywhere in the world, ensuring seamless collaboration and faster workflows. Join us in shaping the future of design and collaboration in an environment where creativity thrives.
Similar jobs
Search for Staff Software Engineer Distributed Data Systems
6,620 results
P-186 At Databricks, we are passionate about empowering data teams to tackle some of the world’s most challenging problems, from security threat detection to cancer drug development. Our mission is to build and operate the leading data and AI infrastructure platform, enabling our customers to concentrate on the high-value challenges that are integral to their own objectives. Founded in 2013 by the original creators of Apache Spark™, Databricks has rapidly evolved from a small office in Berkeley, California, to a global powerhouse with over 1000 employees. Trusted by thousands of organizations, from startups to Fortune 100 companies, we are recognized as one of the fastest-growing SaaS companies worldwide. Our engineering teams create highly sophisticated products that address significant needs in the industry. We continuously push the limits of data and AI technology while maintaining the resilience, security, and scalability essential for our customers' success on our platform. We manage one of the largest-scale software platforms, consisting of millions of virtual machines that generate terabytes of logs and process exabytes of data daily. At this scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must effectively shield our customers from these challenges. Modern data analysis leverages advanced techniques, such as machine learning, that far exceed the capabilities of traditional SQL query engines. As a Software Engineer on the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems that outshine specialized SQL query engines in relational query performance, while providing the flexibility and programming abstractions to support a variety of workloads, from ETL to data science. Examples of projects you may work on include: Apache Spark™: Contributing to the de facto open-source framework for big data. Data Plane Storage: Developing reliable, high-performance services and client libraries for storing and accessing vast amounts of data on cloud storage backends like AWS S3 and Azure Blob Store. Delta Lake: A storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, featuring low latency streaming. Its higher-level abstractions and guarantees, including ACID transactions and time travel, significantly reduce the complexity of real-world data engineering architectures. Delta Pipelines: Aiming to simplify the management of data engineering pipelines.
At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. As a dynamic hybrid of research and product development, we focus on seamlessly integrating advanced multimodal functionalities into our AI offerings, ensuring they are not only reliable and user-friendly but also aligned with our mission to foster broad societal benefits.About the PositionWe are seeking a dedicated Software Engineer specializing in Distributed Data Systems to architect and enhance the infrastructure that supports large-scale multimodal training and evaluation at OpenAI. In this role, you will oversee distributed data pipelines and collaborate closely with our researchers to translate their requirements into robust, high-performance systems. You will play a crucial role in fortifying the pipelines that underpin Sora’s rapid innovation cycles.We are looking for engineers with a keen eye for detail, substantial experience with distributed systems, and a proven track record of building reliable infrastructures in high-stakes environments.This position is based in San Francisco, CA, and follows a hybrid work model requiring three days in the office each week. We also provide relocation assistance to new team members.Key Responsibilities:Design, build, and maintain data infrastructure systems including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale dramatically while maintaining high levels of reliability and efficiency.Collaborate with researchers to deeply understand their needs and translate them into production-ready systems.Harden, optimize, and maintain vital data infrastructure systems that drive multimodal training and evaluation.Ideal Candidates Will Have:Extensive experience with distributed systems and large-scale infrastructure, coupled with a strong passion for data.A detail-oriented mindset and a commitment to building and maintaining dependable systems.Solid software engineering fundamentals and exceptional organizational skills.Comfort with ambiguity and rapid changes in a fast-paced environment.About OpenAIOpenAI is a pioneering AI research and deployment organization dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to advance digital intelligence in a way that is safe and beneficial, pushing the boundaries of innovation and technology.
Ambience Healthcare
About Us:At Ambience Healthcare, we are not just another scribe; we are pioneering an AI intelligence platform that reintegrates humanity into healthcare, delivering significant ROI for health systems nationwide.Our innovative technology empowers providers to concentrate on delivering exceptional care by alleviating the administrative burdens that distract them from their patients and essential duties. Ambience offers real-time, coding-aware documentation and clinical workflow support across various healthcare settings at the leading health systems in North America.Our teams operate with unwavering dedication and extreme ownership to develop optimal solutions for our healthcare partners. We cherish transparency, positivity, and deep contemplation, holding each other to high standards because we recognize that the challenges we tackle are of utmost importance.Recognized as the leader in enhancing clinician experience by KLAS Research in their Emerging Solutions Top 20 Report, honored by Fast Company as one of the Next Big Things in Tech, acknowledged by Inc. as one of the best AI companies in healthcare, and selected as a LinkedIn Top Startup in 2024 and 2025. We're proudly supported by Oak HC/FT, Andreessen Horowitz (a16z), OpenAI Startup Fund, and Kleiner Perkins — and we're just beginning our journey.The Role:Ambience is responsible for processing millions of patient encounters across the largest health systems in the country. These organizations rely on us for real-time clinical workflows where latency and reliability significantly influence patient care. A delay during a patient visit is not merely a negative metric; it can lead to a physician abandoning the tool.In this position, you will oversee the core systems that enable Ambience to scale with reliability: database architecture, caching, multi-tenancy, and performance optimization that influences the user experience for clinicians. You will design database architectures that accommodate our growth, construct caching systems that prevent EHR API latency from affecting critical processes, and develop multi-tenant infrastructures that protect customer data while enhancing performance.Your ultimate goal will be to create infrastructure that other teams rely on effortlessly.Our engineering roles are hybrid, requiring presence in our San Francisco office three times a week.
Join Cloudflare as a Distributed Systems Engineer within our dynamic Data Platform team, focusing on Analytics and Alerts. In this position, you will play a pivotal role in building and optimizing distributed systems that power our data analytics capabilities, providing real-time insights and alerts to enhance our customer experience.
Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.
About GranicaGranica is a pioneering AI research and infrastructure company dedicated to creating reliable and steerable representations of enterprise data.We build trust through Crunch, a policy-driven health layer designed to keep extensive tabular datasets efficient, reliable, and reversible. From this foundation, we are developing Large Tabular Models—systems that learn cross-column and relational structures to provide trustworthy answers and automation, complete with built-in provenance and governance.Our MissionThe current limitations of AI are not solely due to model design but also to the inefficiencies of the data that supports it. At scale, every redundant byte, poorly organized dataset, and inefficient data path contributes to significant costs, latency, and energy waste.Granica’s mission is to eliminate these inefficiencies. We leverage cutting-edge research in information theory, probabilistic modeling, and distributed systems to create self-optimizing data infrastructures that continuously enhance how information is represented and utilized by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari from Stanford University, merging advancements in information theory and learning efficiency with large-scale distributed systems. We believe that the next major breakthrough in AI will stem from innovations in efficient systems, rather than simply larger models.What You Will CreateGlobal Metadata Substrate. Design and refine the global metadata and transactional substrate that enables atomic consistency and schema evolution across exabyte-scale data systems.Adaptive Engines. Architect systems that self-optimize, reorganizing and compressing data according to access patterns, achieving unprecedented efficiency improvements.Intelligent Data Layouts. Innovate new encoding and layout strategies that challenge the theoretical limits of signal per byte read.Autonomous Compute Pipelines. Spearhead the development of distributed compute platforms that scale predictively and maintain reliability even under extreme load and failure conditions.Research to Production. Partner with Granica Research to transform advances in compression and probabilistic modeling into production-ready, industry-leading systems.Latency as Intelligence. Propel systems forward by optimizing for latency as a key aspect of intelligence.
Cloudflare, Inc.
Join Cloudflare as a Software Engineer specializing in Distributed Systems and Infrastructure. In this role, you will be responsible for designing, implementing, and optimizing scalable systems that enhance the performance and reliability of our services. You will collaborate closely with cross-functional teams to develop innovative solutions that support our mission to help build a better Internet.
At Exa, we are on a mission to create a cutting-edge search engine from the ground up, tailored specifically for AI applications. Our team is dedicated to developing large-scale infrastructure that efficiently crawls the internet, trains advanced embedding models for indexing, and constructs high-performance vector databases in Rust for optimized searching. We also manage a state-of-the-art $5M H200 GPU cluster that activates thousands of machines simultaneously.As a Software Engineer specializing in Distributed Data Systems, you will be responsible for designing and implementing the data infrastructure that drives our operations—from crawling billions of web pages to training sophisticated embedding models and delivering real-time search functionalities. You will enjoy significant autonomy in creating systems capable of scaling to hundreds of petabytes. This is your opportunity to work on data pipelines at an unprecedented scale.
About GranicaGranica is an innovative AI research and infrastructure firm dedicated to creating reliable and steerable representations of enterprise data.We build trust through our product Crunch, a policy-driven health layer that ensures large tabular datasets remain efficient, reliable, and reversible. On this solid foundation, we are developing Large Tabular Models—systems designed to learn cross-column and relational structures in order to provide trustworthy answers and automation with inherent provenance and governance.Our MissionAI is currently hampered not only by the design of models but also by the inefficiencies of the data that supports them. Every redundant byte, poorly organized dataset, and inefficient data pathway contributes to significant costs, latency, and energy waste as we scale.Granica aims to eliminate these inefficiencies. We merge cutting-edge research in information theory, probabilistic modeling, and distributed systems to craft self-optimizing data infrastructures: systems that consistently enhance the representation and utilization of information by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari of Stanford University, bridging advancements in information theory and learning efficiency with large-scale distributed systems. Together, we firmly believe that the next major advancement in AI will stem from breakthroughs in efficient systems rather than merely larger models.Your ContributionsGlobal Metadata Substrate: Design a transactional and metadata substrate that facilitates time-travel, schema evolution, and atomic consistency across massive petabyte-scale tabular datasets.Adaptive Engines: Develop systems that autonomously reorganize data, learning from access patterns and workloads to maintain peak efficiency without the need for manual tuning.Intelligent Data Layouts: Optimize bit-level organization (including encoding, compression, and layout) to maximize signal extraction per byte read.Autonomous Compute Pipelines: Create distributed compute systems that scale predictably, adapt to dynamic loads, and ensure reliability under failure conditions.Research to Production: Apply new algorithms in compression, representation, and optimization that emerge from ongoing research. We encourage opportunities to publish and open-source your work.Latency as Intelligence: Design systems that inherently minimize latency as a measure of intelligence.
Stripe, Inc.
Join Stripe as a Staff Software Engineer in our Stream Compute team, where you will play a pivotal role in building scalable solutions that power the financial infrastructure of the internet. As a member of our innovative engineering team, you will leverage your expertise to design and implement robust software solutions that enhance the performance and reliability of our streaming data capabilities.
At Browserbase, we revolutionize web browsing for AI agents and applications. Our innovative headless browser infrastructure automates interactions with websites, simplifies form filling, and replicates user actions seamlessly.Having successfully raised a $40M Series B last year, we are on an accelerated growth trajectory. Supported by esteemed investors such as Kleiner Perkins, CRV, and Notable Capital, our dynamic team is committed to realizing our CEO's vision for empowering the best AI tools and transforming web automation.Our Core Infrastructure team is essential for maintaining the efficiency of our operations. This group tackles significant distributed systems challenges, ensuring our platform's speed, reliability, and scalability.
Cloudflare, Inc.
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, where you will play a pivotal role in developing analytics and alert systems that enhance our services. You will collaborate with a talented team to design scalable and efficient systems to manage and analyze vast amounts of data. Your work will directly impact the performance and reliability of our offerings, ensuring our customers have the best possible experience.
Join Baseten as a Software Engineer focusing on GPU Networking and Distributed Systems. In this pivotal role, you'll collaborate with talented engineers and researchers to develop cutting-edge solutions that leverage GPU technology for high-performance networking operations. Your contributions will be instrumental in shaping the future of distributed systems, enhancing performance, scalability, and reliability.
About UsSieve is a pioneering AI research lab dedicated solely to video data. We harness exabyte-scale video infrastructure and innovative video understanding techniques, along with a multitude of data sources, to create datasets that advance the field of video modeling. Given that video constitutes 80% of internet traffic, it serves as a vital medium that fuels creativity, communication, gaming, AR/VR, and robotics. Our mission is to tackle the most significant challenge in the development of these applications: acquiring high-quality training data.With a small yet highly skilled team of just 15 members, we have formed strategic partnerships with leading AI labs and achieved $XXM in revenue last quarter alone. Our Series A funding round last year was backed by prestigious firms, including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs a Distributed Systems Engineer at Sieve, you will be responsible for designing and implementing systems that efficiently manage the compute, scheduling, and orchestration of complex machine learning and ETL pipelines. Your work will ensure these systems operate quickly, reliably, and cost-effectively while processing large volumes of video data.You will thrive in this role if you are passionate about optimizing system uptime, have experience with cloud technologies, and enjoy working with high-performance distributed systems involving thousands of GPUs. Additionally, you will play a key role in developing excellent internal tools and CI/CD pipelines to facilitate rapid iteration.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. Our hybrid research and product team is dedicated to seamlessly integrating multimodal functionalities into our AI solutions, ensuring they are dependable, user-centric, and aligned with our vision of benefiting society at large.Role OverviewAs a Machine Learning Engineer specializing in Distributed Data Systems, you will be instrumental in designing and scaling the infrastructure that facilitates large-scale multimodal training and evaluation at OpenAI. Your role will involve managing complex distributed data pipelines, collaborating closely with researchers to convert their requirements into robust, production-ready systems, and enhancing pipelines that are essential for Sora's rapid iteration cycles.We are seeking detail-oriented engineers with extensive experience in distributed systems who thrive in high-stakes environments and excel in building resilient infrastructure.This position is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Design, implement, and maintain data infrastructure systems, including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning systems, with a focus on scalability, reliability, and security.Ensure our data platform can scale exponentially while maintaining high reliability and efficiency.Collaborate with researchers to gain a deep understanding of their requirements, translating them into production-ready systems.Strengthen, optimize, and manage critical data infrastructure systems that support multimodal training and evaluation.You Will Excel in This Role If You:Possess strong experience with distributed systems and large-scale infrastructure, coupled with a keen interest in data.Exhibit meticulous attention to detail and a commitment to building and maintaining reliable systems.Demonstrate solid software engineering fundamentals and effective organizational skills.Thrive in environments characterized by ambiguity and rapid change.About OpenAIOpenAI is a trailblazing AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves humanity. We continuously push the boundaries of AI capabilities and strive to create technology that benefits everyone.
About the Role:Join our dynamic ML Infrastructure team as a Software Engineer, where you'll collaborate intimately with the Machine Learning and Product teams to construct top-tier machine learning inference platforms. These cutting-edge platforms drive vital services such as personalized recommendations, search functionalities, and content comprehension at Tubi.Your primary focus will be on the development and maintenance of low-latency ML model serving systems that cater to Deep Learning, LLM, and Search models. This will include the creation of self-service infrastructure and critical components such as the inference engine, feature store, vector store, and experimentation engine.In this role, you'll enhance our service deployment and operational processes, with opportunities to contribute to open-source projects. Enjoy architectural freedom to explore innovative frameworks, spearhead significant cross-functional projects, and elevate the capabilities of our ML and Product teams.We are currently hiring for two positions:Staff Software EngineerPrincipal Software EngineerAdditional Details: As a Principal Engineer, you will serve as a technical leader and visionary, guiding the advancement of our machine learning platform. You'll address complex technical challenges, shape architectural decisions, and mentor senior engineers, fostering a culture of excellence and continuous improvement. Your contributions will impact millions of users.
Join Cloudflare as a Distributed Systems Engineer and help us build and maintain our innovative Data Platform. In this role, you'll be working on our Analytical Database Platform, focusing on enhancing data processing and storage technologies to support our global client base. If you are passionate about distributing systems and enjoy solving complex problems, this is the perfect opportunity for you!
At Scribd, Inc., we are dedicated to enhancing human understanding through our suite of innovative products, including Scribd®, Slideshare®, Everand™, and Fable. Our mission revolves around transforming access into deeper insights and expertise for billions globally.Our CultureWe foster a culture where authenticity and boldness are encouraged; where constructive debates lead to commitment, and where every team member is empowered to prioritize customer needs.We believe that exceptional work emerges from harmonizing individual flexibility with a strong sense of community. Our Scribd Flex program allows employees to select their preferred work style and location, while also emphasizing the importance of intentional in-person interactions to enhance collaboration and culture. All employees are expected to participate in occasional in-person meetings, regardless of their location.We look for team members who embody “GRIT”—the intersection of passion and perseverance towards long-term goals. GRIT serves as a framework for our operations: setting and achieving Goals, delivering impactful Results, contributing Innovative ideas, and building a strong Team through collaboration.Join us at Scribd (pronounced “scribbed”) as we ignite human curiosity and create a world filled with stories and knowledge, democratizing the exchange of ideas and empowering collective expertise.The TeamOur ML Data Engineering team is responsible for powering metadata extraction, enrichment, and content understanding across our platforms.
Gimlet Labs
At Gimlet Labs, we are pioneering the first heterogeneous neocloud tailored for AI workloads. As AI technology evolves, the industry confronts critical limitations in power, capacity, and cost linked to the traditional homogeneous, vertically integrated infrastructure. Gimlet addresses these challenges by decoupling AI workloads from the fundamental hardware, intelligently partitioning them into components and orchestrating each to the hardware that best meets its performance and efficiency needs. This innovative approach facilitates heterogeneous systems across diverse vendors and generations of hardware, including the latest emerging accelerators, resulting in significant improvements in performance and cost efficiency at scale.Building upon this platform, Gimlet is developing a production-grade neocloud for agentic workloads. Our customers can deploy and manage their workloads through stable, production-ready APIs without the complexities of hardware selection, placement, or low-level performance optimization.Gimlet collaborates with foundational labs, hyperscalers, and AI-native companies to enable real production workloads designed to scale to gigawatt-class AI datacenters.We are currently in search of a Technical Staff Member specializing in distributed systems. In this role, you will be instrumental in developing the core platform responsible for scheduling, routing, and managing AI workloads reliably at production scale. You will engage with systems that coordinate execution across thousands of nodes, provide stable production APIs, and guarantee predictable workload performance under real-world conditions of load and failure.This position is ideal for engineers passionate about building foundational infrastructure, grasping end-to-end systems, and operating at scale.
Sign in to browse more jobs
Create account — see all 6,620 results

