Senior Systems Software Engineer
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
About Lumafield
Lumafield is an innovative company at the forefront of engineering technology, committed to revolutionizing how engineers work with cutting-edge X-ray CT scanning solutions. Our mission is to provide accessible tools that enhance product visibility and drive efficiency in engineering processes.
Similar jobs
Search for Senior Systems Software Engineer
7,099 results
Lumafield
About Lumafield: Established in 2019, Lumafield has pioneered the development of the world's first accessible X-Ray CT scanner specifically designed for engineers. Our intuitive scanner, combined with cloud-based software, empowers engineers to gain unparalleled insights into their projects at a remarkably affordable cost. Engineers face high-stakes decisions daily, necessitating tools that provide maximum visibility into their designs. By delivering exceptional product clarity and AI-enhanced tools that identify issues and produce quantitative insights, Lumafield is set to transform the creation, manufacturing, and application of complex products across various sectors. Our company thrives on impact and is dedicated to delivering the utmost value to our customers, ensuring their needs drive our development. Our talented team consists of leading researchers, industrial designers, PhD holders, innovators, and startup founders, all working collaboratively without egos. We proudly receive backing from prestigious venture capital firms, including Kleiner Perkins, Lux Capital, DCVC, and Spark Capital.Headquartered in Cambridge, MA, with an additional office in San Francisco, CA, we are excited to grow our team.About the Role: As a Senior Systems Software Engineer at Lumafield, you will be instrumental in developing the software that drives our cutting-edge, in-line manufacturing CT scanning products. You will engage with state-of-the-art X-ray physics, high-speed detectors, image processing, and embedded systems. Collaborating within a small team focused on our latest hardware, you will harness your expertise to maximize system performance and achieve outstanding results for our clients. This position is perfect for those eager to take ownership of embedded systems, firmware, and software design in an early-stage product environment. This role is based in our San Francisco, CA office, with occasional travel required to our Cambridge, MA office.
At NerdWallet, we are committed to empowering individuals to make informed financial decisions. Our team comprises exceptional individuals who thrive in an inclusive, flexible, and candid environment. Whether you choose to work remotely or in the office, we prioritize your well-being, professional development, and the impact you can make. We believe that when one of us elevates our skills, the whole team benefits.As part of NerdWallet’s Platform team, you will oversee the systems that serve as the backbone of our consumer experience. This includes management of our centralized product data platform, partner ingestion pipelines, publishing and click-tracking infrastructure, GraphQL gateway operations, and our high-traffic, headless WordPress CMS. These platforms deliver precise, compliant, and high-performance product and content experiences to millions of users on both web and mobile platforms. We are searching for a Senior Engineering Manager to lead this team in modernizing legacy services into scalable and reliable systems while advancing our vision of a decoupled, adaptable platform that facilitates quicker publishing, enhanced observability, and future growth.In the role of Senior Engineering Manager for Platform Systems, you will guide and support a team of engineers in delivering high-quality, scalable, and secure software that aligns with NerdWallet’s product and business objectives. You will collaborate closely with Product Managers and other cross-functional partners to define the roadmap, prioritize tasks, and eliminate obstacles, while nurturing strong engineering practices and a culture of continuous improvement. Your responsibilities will include ensuring technical quality, team well-being, and daily operations, while mentoring engineers, making strategic technical decisions, and balancing immediate deliverables with long-term sustainability, compliance, and reliability.This position reports to the Director of Engineering.Opportunities for Impact:Lead, mentor, and develop a high-performing engineering team responsible for NerdWallet’s platform systems, including the Content Platform, CMS, and Product Data Platform.Collaborate with Product Managers and cross-functional teams to strategize, prioritize, and execute the product roadmap.Champion consistent adherence to software development best practices, including code quality, testing, documentation, and operational excellence.Influence and guide technical and architectural decisions to ensure solutions are scalable, secure, reliable, and compliant with regulatory standards.Balance immediate project needs with long-term project vision and maintainability.
About GranicaGranica is an innovative AI research and infrastructure firm dedicated to creating reliable and steerable representations of enterprise data.We build trust through our product Crunch, a policy-driven health layer that ensures large tabular datasets remain efficient, reliable, and reversible. On this solid foundation, we are developing Large Tabular Models—systems designed to learn cross-column and relational structures in order to provide trustworthy answers and automation with inherent provenance and governance.Our MissionAI is currently hampered not only by the design of models but also by the inefficiencies of the data that supports them. Every redundant byte, poorly organized dataset, and inefficient data pathway contributes to significant costs, latency, and energy waste as we scale.Granica aims to eliminate these inefficiencies. We merge cutting-edge research in information theory, probabilistic modeling, and distributed systems to craft self-optimizing data infrastructures: systems that consistently enhance the representation and utilization of information by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari of Stanford University, bridging advancements in information theory and learning efficiency with large-scale distributed systems. Together, we firmly believe that the next major advancement in AI will stem from breakthroughs in efficient systems rather than merely larger models.Your ContributionsGlobal Metadata Substrate: Design a transactional and metadata substrate that facilitates time-travel, schema evolution, and atomic consistency across massive petabyte-scale tabular datasets.Adaptive Engines: Develop systems that autonomously reorganize data, learning from access patterns and workloads to maintain peak efficiency without the need for manual tuning.Intelligent Data Layouts: Optimize bit-level organization (including encoding, compression, and layout) to maximize signal extraction per byte read.Autonomous Compute Pipelines: Create distributed compute systems that scale predictably, adapt to dynamic loads, and ensure reliability under failure conditions.Research to Production: Apply new algorithms in compression, representation, and optimization that emerge from ongoing research. We encourage opportunities to publish and open-source your work.Latency as Intelligence: Design systems that inherently minimize latency as a measure of intelligence.
Discord Inc.
Join Discord, a vibrant platform that connects over 200 million users monthly, primarily through gaming. More than 90% of our users engage in gaming, contributing to an astounding 1.5 billion hours spent playing diverse titles each month. At Discord, we are committed to enhancing the gaming experience by fostering fun and engaging conversations among players before, during, and after gameplay.The Design Systems team is a dynamic, cross-functional unit dedicated to ensuring top-notch design quality across every aspect of user interface, content, and graphics. We create fundamental design elements and standards that empower engineers and designers in the development of Discord’s products. Our collaborative efforts accelerate the delivery of new features that are not only accessible and consistent but also of the highest quality. Additionally, we set the accessibility benchmarks for Discord and assist other engineering teams in meeting these standards. This pivotal role directly supports Discord's mission of creating an inclusive community where everyone can find their place.
Foundry Robotics
About UsAt Foundry Robotics, we are pioneering an AI-driven, assembly-centric contract manufacturing model. Our approach integrates software-defined manufacturing, where both robots and humans collaborate to achieve tangible production outcomes.Our unique value proposition transcends simple models; we deliver a powerful software infrastructure that converts:Orders into SchedulesSchedules into Work InstructionsWork into TelemetryTelemetry into Continuous Improvement of Throughput, Quality, and CostThis position is focused on senior software engineering, specifically in creating the cloud and on-premises systems that operate the factory, alongside the data architecture that ensures these systems are observable, reliable, and scalable.Please note, we are also recruiting Robotics Research Engineers separately; this role is distinct from that.The RoleWe are in search of a Senior Software Engineer (Individual Contributor) to develop the backend systems and infrastructure that will drive the factory of the future.Your responsibilities will include designing and implementing:Cloud-based backend systemsOn-premises factory servicesEmbedded-to-cloud data flowsPetabyte-scale visual data pipelinesMachine learning infrastructure across diverse computational environmentsYou will be tasked with creating production systems that remain operational in challenging conditions, such as noisy factories, unreliable connectivity, and the real costs associated with downtime.This is a practical engineering role where you will be expected to deliver tangible results.Key ResponsibilitiesCloud & On-Prem ArchitectureDevelop and sustain a cloud-based backend for planning, scheduling, execution, inventory management, traceability, and telemetryDesign architectures that span embedded, on-premises, and cloud systems across multiple compute environmentsImplement infrastructure-as-code principles using TerraformEnsure secure and reliable data transfer between the factory floor and the cloudML Infrastructure & Data SystemsEstablish petabyte-scale visual data pipelines for the ingestion, curation, indexing, and preparation of manufacturing data
At Databricks, we are driven by a passion for empowering data teams to tackle the world’s most challenging problems — from transforming transportation to accelerating medical innovations. We achieve this by creating and maintaining the leading data and AI infrastructure platform, enabling our clients to leverage profound data insights for business enhancement. Founded by engineers with a customer-first mentality, we eagerly embrace every opportunity to tackle complex technical challenges, ranging from the design of next-generation UI/UX for data interactions to scaling our services across millions of virtual machines. Our journey has just begun.As a member of the Runtime team at Databricks, you will be instrumental in developing the next generation of distributed data storage and processing systems. These systems will surpass specialized SQL query engines in relational query performance while offering the programming abstractions necessary to support a variety of workloads, from ETL to data science.Example projects include:Apache Spark™: Contribute to the de facto open-source standard framework for big data.Data Plane Storage: Develop reliable and high-performance services and client libraries for managing vast amounts of data within cloud storage backends like AWS S3 and Azure Blob Store.Delta Lake: Design a storage management system that merges the scalability and cost-effectiveness of data lakes with the performance and reliability of data warehouses, providing features like ACID transactions and time travel.Delta Pipelines: Simplify the orchestration and operation of numerous data pipelines, enabling clients to deploy, test, and upgrade pipelines effortlessly.Performance Engineering: Create the next-generation query optimizer and execution engine that is fast, scalable, and robust.
At Scribd, Inc., we are dedicated to enhancing human understanding through our suite of innovative products, including Scribd®, Slideshare®, Everand™, and Fable. Our mission revolves around transforming access into deeper insights and expertise for billions globally.Our CultureWe foster a culture where authenticity and boldness are encouraged; where constructive debates lead to commitment, and where every team member is empowered to prioritize customer needs.We believe that exceptional work emerges from harmonizing individual flexibility with a strong sense of community. Our Scribd Flex program allows employees to select their preferred work style and location, while also emphasizing the importance of intentional in-person interactions to enhance collaboration and culture. All employees are expected to participate in occasional in-person meetings, regardless of their location.We look for team members who embody “GRIT”—the intersection of passion and perseverance towards long-term goals. GRIT serves as a framework for our operations: setting and achieving Goals, delivering impactful Results, contributing Innovative ideas, and building a strong Team through collaboration.Join us at Scribd (pronounced “scribbed”) as we ignite human curiosity and create a world filled with stories and knowledge, democratizing the exchange of ideas and empowering collective expertise.The TeamOur ML Data Engineering team is responsible for powering metadata extraction, enrichment, and content understanding across our platforms.
Join Crusoe as a Principal Systems Software Engineer and play a vital role in revolutionizing the tech industry. You will lead the development of innovative software solutions that enhance our systems and platforms, contributing to the overall mission of providing efficient and sustainable computing resources. Your expertise will help shape the future of our software architecture and ensure seamless integration across various applications.
About Our Team:Join the innovative Database Systems team at OpenAI, where we specialize in high-performance distributed databases. We are the architects behind Rockset, a cutting-edge real-time search, analytics, and vector database that powers all vector search and retrieval augmented generation (RAG) at OpenAI. Rockset underpins core functionalities across all OpenAI product lines and supports various critical internal applications.About the Role:We are in search of engineers who are passionate about distributed systems, performance optimization at a low level (with our core engine developed in C++), and constructing scalable database infrastructures from scratch. As a member of the Database Systems team, you will play a key role in enhancing the core database engine, making significant contributions to ingestion, query execution, indexing, and storage improvements. You will collaborate with multiple teams across OpenAI to unlock new product capabilities and ensure the reliability and scalability of our online database as usage expands exponentially.Your Responsibilities Will Include:Design, develop, and maintain high-performance distributed systems.Identify and address performance bottlenecks to elevate infrastructure capabilities.Define and guide the long-term technical vision and evolution of the system.Collaborate with product, engineering, and research teams to deliver robust and scalable infrastructure.Investigate complex production issues across the entire technology stack.Contribute to incident response, retrospective analyses, and establishing best practices for system reliability.You Will Excel In This Role If You:Possess substantial experience in building, scaling, and optimizing distributed systems.Exhibit a keen interest in database internals, storage engines, or low-latency query systems.Enjoy tackling complex performance challenges in high-throughput systems.Have experience managing and operating production clusters at scale (e.g., Kubernetes or similar orchestration tools).Approach scalability, correctness, and reliability with a rigorous mindset.Thrive in a fast-paced environment where you can make a significant impact.Qualifications:4+ years of relevant industry experience with a focus on distributed systems.Proficiency in C++ or similar low-level programming languages.Strong problem-solving skills and attention to detail.Experience with performance monitoring and optimization tools.Excellent collaboration and communication skills.
The Role Are you a skilled software engineer with a proven track record in building and refining production systems? Are you eager to apply your expertise at the forefront of AI technology? If so, this opportunity may be perfect for you. As a Senior Software Engineer on our Natural Language Understanding team within the “agent lab,” you will be pivotal in our mission to enhance the capabilities of AI agents for reliable, scalable performance. You will have the chance to influence the evolution of the Moveworks AI Assistant platform in several key areas: agent orchestration, sandboxed file systems, latency optimization, and multimodal I/O, among others. You will leverage the best tools in enterprise AI, including cutting-edge LLMs from top providers like OpenAI. Our team prioritizes rapid innovation on scalable infrastructure while tackling challenging product and engineering obstacles to deliver exceptional value to our clients. If you are looking to achieve the pinnacle of your career alongside a passionate, dedicated team focused on making an impact, we invite you to connect with us.
About Ditto Ditto builds technology for resilient, real-time data flow at the edge. The company’s peer-to-peer synchronization engine keeps devices connected and data consistent, even when internet access is unreliable or unavailable. Organizations like Chick-fil-A, Delta Airlines, and the U.S. military use Ditto to power mission-critical experiences in aviation, retail, travel, hospitality, and defense. With over $145 million in funding, Ditto is a fast-growing, globally distributed startup committed to building a diverse and inclusive team, essential for solving tough connectivity problems in challenging environments. Role Overview: Senior Software Engineer - Autonomy (Remote) This Senior Software Engineer role focuses on autonomy and field deployment. As a Forward Deployed Engineer, work directly with key users to integrate Ditto’s platform into operational environments, especially where robotics and real-time data are essential. The position calls for adaptable engineers who can quickly solve complex technical challenges and reduce the time it takes for customers to realize value from Ditto’s software. Expect to collaborate closely with both users and Ditto’s core product engineering team, relaying technical feedback and feature requests. The work often involves ambiguity, rapid troubleshooting, and direct involvement in field testing. Key Responsibilities Integrate with Robotic Platforms: Lead on-site software integration with unmanned ground, aerial, and maritime systems. Establish reliable data connections between Ditto’s synchronization layer and various robotic autonomy stacks. Develop on ROS2 and DDS Middleware: Design, build, and debug software nodes within ROS2 frameworks. Use DDS (Data Distribution Service) for real-time, publish-subscribe communication between robotic subsystems and Ditto’s platform. Implement MAVLink Integrations: Create and refine MAVLink-based communication channels for telemetry, command, and control of unmanned aerial systems. Ensure dependable data transfer between Ditto’s platform and autopilot firmware. Solve Problems in Real-Time: Act as the first line of technical support during field testing. Diagnose and resolve software, sensor, and communication issues on robotic platforms as they arise. Location Remote. Candidates in Atlanta, Austin, San Francisco, or Seattle are encouraged to apply.
sfcompute
Join us at sfcompute, where we are revolutionizing the future by mitigating risks associated with the largest infrastructure development in history.As the demand for GPU clusters surges, financing these data centers and their supporting infrastructure has never been more critical. Our innovative approach ensures that financing is secured through long-term contracts, providing peace of mind to both lenders and developers.In the fast-paced world of AI and compute resources, we are creating a liquid market for GPU offtake, allowing even small startups to access high-end computing power without the burdens of traditional financing.About the RoleAs a Systems Software Engineer at sfcompute, you will be instrumental in developing a GPU market that brings the advanced software capabilities of hyperscalers to our innovative GPU neoclouds. Your responsibilities will encompass provisioning and monitoring bare metal servers with our virtualization orchestration software, as well as collaborating with our GPU marketplace to facilitate user configurations of VMs, networks, and storage.Key tasks include creating and maintaining a Linux OS image tailored for our tools, ensuring consistent deployment across nodes with specific data-center adjustments, and designing the API protocols and servers for user interaction.Our primary programming language is Rust, which enables us to write efficient code across all system layers, from web servers to kernel coordination. If you are familiar with memory-managed languages like C and possess experience in higher-level programming, we encourage you to apply.
About Scribd:At Scribd Inc. (pronounced “scribbed”), we ignite human curiosity by fostering a world rich in stories and knowledge. Join our innovative team as we democratize the flow of ideas and information, empowering collective expertise through our diverse product offerings: Everand, Scribd, Slideshare, and Fable.This job posting represents an open and approved position within our organization.We cultivate a culture where authenticity and boldness thrive; where we engage in vibrant discussions and embrace the unexpected; and where every employee is empowered to take meaningful actions with a firm customer focus.Our work structure emphasizes a balance between individual flexibility and community engagement. Through our Scribd Flex program, employees can collaborate with their managers to determine the most effective work style that meets their personal needs. A core principle of Scribd Flex is prioritizing intentional in-person interactions that foster collaboration, culture, and connection. Therefore, occasional in-person attendance is required for all Scribd employees, irrespective of their location.What do we seek in new team members? We hire for “GRIT.” GRIT embodies the blend of passion and perseverance towards long-term goals. At Scribd Inc., we believe in the possibilities this can unlock and encourage our employees to adopt a GRIT-driven approach to their work. In practical terms, GRIT represents our standards: we look for individuals who can set and accomplish Goals, deliver Results in their roles, contribute Innovative ideas and solutions, and positively impact the Team through collaboration and attitude.
Aurelius Systems
About Us:Aurelius Systems is a venture capital-backed startup at the forefront of defense technology, specializing in the development of autonomous, edge-deployed robotic systems utilizing directed energy for counter-unmanned aerial systems (UAS).Our innovative approach involves creating laser systems designed to neutralize drones.With a dedicated team of approximately 10 engineers, former U.S. military personnel, and industry experts, we are committed to advancing America's capabilities in directed energy technology, delivering the first cost-effective and reliable laser weapon systems.Inspired by the philosophy of Marcus Aurelius, we emphasize consistent effort and accountability in our work, embodying a culture of high output without excuses. Following in the footsteps of pioneers like Henry Ford, we embrace innovation and action within our small but impactful team.In addition to our San Francisco headquarters, we are proud to operate a manufacturing hub in Detroit and conduct field tests weekly on our expansive private range.If you thrive on seeing your engineering contributions directly in action rather than being confined to a lab, we encourage you to explore this opportunity.The Position & Your Contribution:As a Robotics Software Systems Engineer, your primary responsibility will be to ensure that all subsystems function seamlessly and efficiently together.Our system comprises a complex array of subsystems including sensing, computer vision, machine learning inference, control systems, power management, and mechanical actuation. Achieving minimal processing time and inter-process latency is crucial for successfully targeting our nimble and evasive UAS.The key area we are looking to fill is real-time systems performance at the hardware interface. You should possess a deep understanding of how software execution impacts physical system behavior, how latency accumulates across CPU, GPU, memory, and I/O, and how bandwidth limitations influence sensor data processing. We need an engineer who is detail-oriented, considering microseconds, memory bandwidth, cache behavior, and system determinism.In our tight-knit team of around 10 engineers, you will have the opportunity to take ownership of systems that are field-tested. The success of our tests is binary—it's either effective or it isn't—and your role will involve iterative improvement based on real-world outcomes.Your Responsibilities:Manage the latency budget for the entire platform, from data sensing to actuation.Profile and mitigate latency across CPU, GPU, memory, and I/O interfaces.Develop and optimize kernels for high-throughput, low-latency operations.Adjust memory access patterns for optimal performance.
About AnthropicAt Anthropic, our mission is to develop AI systems that are reliable, interpretable, and steerable. We are committed to ensuring that AI technology is safe and beneficial for our users and society at large. Our rapidly expanding team comprises dedicated researchers, engineers, policy experts, and business leaders, all working collaboratively to create beneficial AI solutions.About the RoleThe Infrastructure organization at Anthropic plays a critical role in our mission to create reliable AI systems. The systems we develop are essential for accelerating the training of new models, conducting safety experiments effectively, and scaling our AI technology, Claude, to serve millions of users. We strive to demonstrate that robust infrastructure and cutting-edge capabilities can work together harmoniously.The Systems engineering team is responsible for ensuring compute uptime and resilience at scale, building the clusters, automation, and observability that enable safe and effective frontier AI research and deployment.Team Matching: After the interview process, team assignments are based on interview performance, individual interests, and business needs. Candidates may be considered for various Infrastructure teams.
About Our TeamThe Platform Systems team at OpenAI is at the forefront of innovation, merging advanced AI technologies with large-scale distributed systems. We are tasked with creating the engineering and research infrastructure essential for training OpenAI's premier models on some of the most powerful, custom-built supercomputers globally.Our team is dedicated to developing the core software for model training, delving deep into the technological stack. This encompasses collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we design are pivotal to enhancing OpenAI's research capabilities, facilitating reliable and efficient training at the leading edge of technology.We work in close partnership with researchers across the organization, continuously integrating insights from various OpenAI projects to advance our training platform.About the RoleAs a Software Engineer specializing in Platform Systems, you will architect and develop distributed systems that enhance visibility into large-scale training operations, ensuring their dependable operation at scale.Your responsibilities will include designing systems for failure detection, tracing, and observability that pinpoint slow or malfunctioning nodes, identify performance bottlenecks, and assist engineers in optimizing extensive distributed training tasks. This infrastructure is integral to the functionality of OpenAI's training stack and is continuously evolving to accommodate new use cases and increasingly intricate workloads.This position is central to our training infrastructure, merging systems engineering, performance analysis, and large-scale debugging.Key ResponsibilitiesDesign and develop distributed failure detection, tracing, and profiling systems tailored for large-scale AI training jobs.Create tools to identify slow, faulty, or errant nodes and deliver actionable insights into system behavior.Enhance observability, reliability, and performance across OpenAI's training platform.Troubleshoot and resolve issues within complex, high-throughput distributed systems.Collaborate effectively with systems, infrastructure, and research teams to advance platform capabilities.Adapt and expand failure detection and tracing systems to support new training paradigms and workloads.Ideal Candidate ProfilePossesses a deep passion for performance, stability, and observability in distributed systems.Demonstrates proficiency in systems engineering and performance analysis.Has experience in debugging high-throughput distributed systems.Exhibits strong collaboration skills with a track record of working with cross-functional teams.Shows adaptability and eagerness to embrace new technologies and methodologies.
Location: San Francisco, CA (Hybrid: 4 days onsite/week). Relocation assistance available.About Our Team:At OpenAI, we are at the forefront of technology, creating foundational platform software that ensures our consumer products are reliable, secure, and high-performing. Our team collaborates across various system layers, working closely with engineering partners to deliver exceptional capabilities from initial concept to final launch.Role Overview:We are looking for a passionate Systems Software Engineer to lead the design, implementation, and debugging of critical platform components and the pipelines that build and update system images. Your focus will span across operating system layers, emphasizing performance optimization, security enhancements, and in-depth system debugging to deliver production-grade systems that exceed expectations.Key Responsibilities:Design and develop robust system-level components and services within both kernel and user spaces.Configure and maintain essential OS platform services (init, services, networking, security policies) and related tools.Build and manage image and update pipelines, ensuring their reliability, reproducibility, and rollback safety.Instrument system performance through profiling and tracing; enhance CPU, memory, I/O, and energy efficiency.Oversee platform observability and reliability, including logging, crash capture, watchdogs, and diagnostics.Collaborate with cross-functional teams to define interfaces and deliver comprehensive end-to-end features.Establish and promote strong engineering practices such as code reviews, continuous integration, reproducible builds, and effective release management.Work alongside external vendors to support builds and deployments.You Will Excel in This Role If You:Have successfully launched production systems software on modern operating systems.Possess proficiency in C/C++ and a scripting language, with a strong understanding of OS internals including concurrency, memory management, filesystems, networking, and power management.Demonstrate exceptional systems debugging skills utilizing debuggers, tracers, profilers, and logs across kernel/user-space boundaries.Comprehend the configuration of platform services and interfaces, effectively translating requirements into stable, well-documented APIs.Are knowledgeable about user-space foundations including service management, IPC, networking, packaging, and automation.Have experience collaborating with external partners to deliver high-quality software solutions.
Join Carta as a Senior Software Engineer II in our Design Systems team, where you will play a pivotal role in shaping the user experience across our products. Collaborating with cross-functional teams, you will design, develop, and refine our design system to ensure consistency and efficiency. Your expertise will help drive innovation while maintaining performance and scalability.
Why Join Achira?Become part of an exceptional team comprised of scientists, ML researchers, and engineers dedicated to transforming the landscape of drug discovery.Engage with cutting-edge machine learning infrastructure at an unprecedented scale, leveraging extensive computing resources, vast datasets, and ambitious goals.Take ownership of significant projects from conception through to architecture and deployment on large-scale infrastructures.Thrive in a culture that values thoroughness, speed, and a proactive, builder-oriented mindset.About the RoleAt Achira, we are developing state-of-the-art foundation models that address the most complex challenges in simulation for drug discovery and beyond. Our atomistic foundation simulation models (FSMs) serve as comprehensive representations of the physical microcosm, encompassing machine learning interaction potentials (MLIPs), neural network potentials (NNPs), and various generative model classes.We are looking for a Software Engineer who is enthusiastic about distributed computing and its applications in machine learning. You will play a pivotal role in designing and constructing the infrastructure for our ML data generation pipelines, model training, and fine-tuning workflows across large-scale distributed systems.Your expertise will be crucial in ensuring our compute clusters are efficient, observable, cost-effective, and dependable, enabling us to advance the frontiers of ML development. If you are passionate about distributed systems, performance optimization, and cloud cost efficiency, we encourage you to apply.You will be empowered to conceptualize and manage complex workloads across multiple vendors worldwide. Achira's mission revolves around computation, and providing seamless access to our uniquely tailored workloads at the lowest possible cost is critical to our success.
About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technology, responsible for creating, deploying, and maintaining some of the world's largest supercomputers. These supercomputers are pivotal for training our most advanced AI models, pushing the boundaries of innovation.We transform sophisticated data center designs into operational systems and develop the software infrastructure necessary for extensive frontier model training. Our goal is to ensure these hyperscale supercomputers operate reliably and efficiently, supporting groundbreaking AI research.About the RoleAs a key member of the Frontier Systems team, you will be instrumental in designing the critical infrastructure that ensures our supercomputers function seamlessly for pioneering AI research. In this role, you'll address system-level challenges and implement automation solutions that minimize disruptions during large-scale training processes.Your responsibilities will encompass end-to-end ownership of your projects, allowing you to make significant contributions to our mission. This position is ideal for individuals who excel in diagnosing complex system issues and crafting automation strategies to proactively resolve problems across a vast network of machines.Your Responsibilities Include:Enhancing system health checks to maintain the stability of our hyperscale supercomputers during model training.Conducting in-depth investigations into hardware failures and system-level bugs to uncover root causes.Developing automation tools that monitor and resolve issues across thousands of systems, enabling uninterrupted research progress.You May Be a Great Fit If You Possess:7+ years of hands-on experience in software engineering.Strong proficiency in Python and shell scripting.Expertise in analyzing complex data sets using SQL, PromQL, Pandas, or other relevant tools.Experience in creating reproducible analyses.A solid balance of skills in both building and operationalizing systems.Prior experience with hardware is not a prerequisite for this role.Preferred Qualifications:Familiarity with the intricacies of hardware components, protocols, and Linux tools (e.g., PCIe, Infiniband, networking, power management, kernel performance tuning).Experience with system optimization and performance tuning.
Sign in to browse more jobs
Create account — see all 7,099 results

