Senior Software Engineer - Fleet Health at Gridware | San Francisco
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
About Gridware
Gridware is a leading-edge technology company dedicated to enhancing the reliability and safety of the electrical grid. Our pioneering Active Grid Response (AGR) platform utilizes high-precision sensors and innovative management strategies to ensure efficient grid operation and proactive maintenance. We are committed to sustainability and backed by influential climate-tech investors.
Similar jobs
Search for Software Engineer Fleet Management
8,525 results
Join the Fleet team at OpenAI, where we empower groundbreaking research and innovative product development by maintaining a robust computing environment. Our team manages extensive systems that encompass data centers, GPUs, and networking, ensuring peak performance, high availability, and efficiency. Our mission is to facilitate the seamless operation of OpenAI's models at scale, supporting both internal research initiatives and external products such as ChatGPT, while prioritizing safety, reliability, and responsible AI deployment over unchecked expansion.About the PositionAs a Software Engineer specializing in Operating Systems & Orchestration, you will play a crucial role in developing systems that manage our hardware, configurations, vendors, and the teams utilizing our infrastructure. Your work will involve designing and implementing solutions that fuse individual nodes and servers into cohesive clusters, directly enhancing the AI research experience. This role is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new hires.Key Responsibilities:Architect and develop systems to manage extensive cloud and bare-metal infrastructures at scale.Create tools that correlate low-level hardware metrics with high-level job scheduling and cluster management algorithms.Utilize Large Language Models (LLMs) to streamline vendor operations and enhance infrastructure workflows.Automate infrastructure processes to minimize repetitive tasks and bolster system reliability.Work collaboratively with hardware, infrastructure, and research teams to ensure smooth integration across all components.Continuously refine tools, automation, processes, and documentation to boost operational effectiveness.Ideal Candidate Profile:Demonstrates strong software engineering capabilities with experience in large-scale infrastructure environments.Possesses extensive knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud platforms).Has deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing).Is passionate about enhancing the performance and reliability of large compute fleets.Thrives in fast-paced environments and is eager to tackle complex challenges.
Join the Fleet Infrastructure team at OpenAI, where you will play a pivotal role in managing and enhancing one of the world's largest and most efficient GPU fleets, dedicated to powering OpenAI's advanced model training and deployment initiatives. Your contributions will range from:Developing user-friendly scheduling and quota systems to maximize GPU utilization.Creating automated solutions for seamless Kubernetes cluster provisioning and upgrades, ensuring a robust and low-maintenance platform.Building service frameworks and deployment systems that support diverse research workflows.Enhancing model startup times through high-performance snapshot delivery, leveraging advanced blob storage and hardware caching techniques.And much more!About the RoleAs a Software Engineer in Fleet Infrastructure, you will design, develop, deploy, and maintain essential infrastructure systems that facilitate model training and deployment on a massive GPU fleet. This role presents an exciting opportunity to influence a critical system that supports OpenAI's mission to responsibly advance AI capabilities, all while working in a fast-paced environment with tight deadlines.Positioned in San Francisco, CA, we embrace a hybrid work model, encouraging three days in the office each week, along with offering relocation assistance for new hires.In this role, you will:Design, implement, and manage components of our compute fleet, focusing on job scheduling, cluster management, snapshot delivery, and CI/CD systems.Collaborate closely with research and product teams to understand and meet workload requirements effectively.Work alongside hardware, infrastructure, and business teams to deliver a service characterized by high utilization and reliability.
About Our TeamAt OpenAI, the Fleet team is integral to maintaining the robust computing environment that fuels our groundbreaking research and innovative product development. We manage extensive systems encompassing data centers, GPUs, networking, and more, ensuring optimal performance, availability, and efficiency. Our efforts empower OpenAI’s models to function seamlessly at scale, supporting both internal R&D and external offerings like ChatGPT. We emphasize safety, reliability, and responsible AI deployment over unrestrained growth.About the RoleAs a Software Engineer on the Fleet Hardware team, you will play a crucial role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for research training progress and service stability since even minor disruptions can lead to significant setbacks. With the increasing complexity of supercomputers, the pressure to maintain operational integrity has never been higher.This is a unique opportunity to be at the forefront of technology, pioneering solutions for troubleshooting advanced systems on a large scale. You will work with cutting-edge technologies and innovate solutions to ensure the health and efficiency of our supercomputing infrastructure.Our team empowers skilled engineers with a significant degree of autonomy and ownership, enabling them to drive impactful change. This role requires a keen focus on comprehensive system investigations and the development of automated solutions. We seek individuals who dive deep into problems, conduct thorough investigations, and create automation for large-scale detection and remediation.In this Role, You Will:Design and maintain automation systems for provisioning and managing server fleets.Develop tools to monitor server health, performance, and lifecycle events.Collaborate with teams across clusters, networking, and infrastructure.Partner with external operators to uphold high-quality standards.Identify and resolve performance bottlenecks and inefficiencies.Continuously enhance automation to minimize manual tasks.
Waymo is a pioneering autonomous driving technology firm dedicated to becoming the world's most reliable driver. Originating from the Google Self-Driving Car Project in 2009, Waymo has concentrated on creating the Waymo Driver—The World’s Most Experienced Driver™—to enhance mobility accessibility while preventing traffic-related fatalities. The Waymo Driver powers our fully autonomous ride-hailing service and can be integrated across various vehicle platforms and use cases. With over ten million rider-only trips facilitated, our technology has autonomously traversed more than 100 million miles on public roads and billions in simulation across more than 15 U.S. states.Key Responsibilities:Architect, develop, test, and optimize Angular applications leveraging TypeScript and contemporary development methodologies.Construct and enhance mission-critical tools and systems that empower Waymo's expansion into new markets.Address complex real-world challenges associated with Waymo Fleet Monitoring and the development of reusable platform functionalities.Collaborate closely with Product, UX, and fellow engineers to design and develop internal user-centric products.Deliver solutions to unique challenges in a dynamic environment.
About GridwareGridware is an innovative technology firm based in San Francisco, committed to advancing and safeguarding the electrical grid. We have pioneered a revolutionary approach to grid management known as Active Grid Response (AGR), which emphasizes the monitoring of the electrical, physical, and environmental variables that influence grid reliability and safety. Our sophisticated AGR platform leverages high-precision sensors to identify potential problems early, facilitating proactive maintenance and fault prevention. This holistic strategy enhances safety, minimizes outages, and ensures the efficient operation of the grid. Backed by prominent climate-tech and Silicon Valley investors, Gridware is at the forefront of sustainable energy solutions. For more details, visit www.Gridware.io.Role DescriptionWe are on the lookout for a Senior Software Engineer to spearhead the development of systems that manage our expanding fleet of devices, which forms the backbone of Gridware's distributed sensing network. The technology you develop will enable Fleet Health operators to oversee device performance, deploy firmware updates, and ensure the reliability and security of our sensors at scale.You will collaborate across backend services, internal tools, and observability systems, designing the infrastructure that maintains thousands of edge devices in optimal health. This role is hands-on and demands high ownership, situated at the intersection of software, hardware, and operations—ideal for engineers who are passionate about building platforms that create tangible real-world impacts.
Waymo is seeking a passionate and experienced Senior Backend Engineer to join our Fleet Infrastructure team. In this role, you will be responsible for designing, developing, and maintaining robust backend systems that support our fleet of autonomous vehicles. You will work collaboratively with cross-functional teams to ensure that our technology meets the highest standards of performance and reliability.
MongoDB, Inc.
Join MongoDB as a Senior Site Reliability Engineer specializing in Fleet Management. In this role, you will be pivotal in enhancing the reliability and performance of our systems, ensuring seamless operations across our platforms. You will collaborate with cross-functional teams to design, implement, and maintain infrastructure solutions that meet the needs of our growing customer base.Your expertise will be crucial in identifying performance bottlenecks, automating processes, and orchestrating system deployments. If you are passionate about building scalable and resilient systems and thrive in a fast-paced environment, we want to hear from you!
DigitalOcean
Join our dynamic team at DigitalOcean as a Senior Engineer II, focusing on enhancing Fleet Efficiency. In this role, you will leverage your expertise to optimize our fleet operations, ensuring peak performance and sustainability. Collaborate with cross-functional teams, utilize advanced analytics, and implement innovative solutions that drive efficiency and reduce costs.
As a Lead Fleet Support Engineer on our Vehicle Development team, you will play a pivotal role in ensuring the operational efficiency and reliability of our cutting-edge robotaxi fleet. You will be the primary point of contact for resolving intricate technical challenges across all major vehicle systems, providing hands-on expertise and support in real-time. Your contributions will be essential in maximizing fleet uptime and enhancing our innovative transportation solutions.
About Our TeamAs part of the Fleet Scheduling team, our Full Stack Engineers are committed to creating innovative and scalable interfaces that empower researchers to effectively manage AI workloads across some of the largest supercomputing infrastructures globally. We focus on building robust, high-performance systems that deliver real-time insights, resource tracking, and seamless interactions with complex infrastructures. Our mission is to enhance resource allocation, reduce operational overhead, and develop user-friendly tools that boost researcher productivity and system transparency.About the RoleIn this exciting position, you will design, develop, and operate web-based systems that provide an intuitive interface to OpenAI’s supercomputing clusters. You will work closely with researchers, product teams, and infrastructure teams to deliver scalable solutions that facilitate seamless monitoring, job scheduling, and resource management. This is a unique opportunity to engage at the forefront of AI infrastructure, designing tools capable of scaling to exascale workloads while ensuring optimal usability and performance.This role is based in San Francisco, CA, with a hybrid work model requiring 3 days in the office per week. We also offer relocation assistance to new employees.In this role, you will:Design and develop full-stack web applications for real-time tracking and management of large-scale AI workloads.Collaborate with researchers and infrastructure teams to translate complex operational needs into intuitive user interfaces and scalable backend systems.Create data visualization tools (e.g., Gantt charts, dashboards) to enhance insights into job scheduling and resource allocation.Optimize backend services for high data throughput, ensuring low-latency performance and high availability.Implement frontend components that enable smooth interactions with scheduling, storage, and compute systems.Guarantee system security, reliability, and scalability across globally distributed supercomputing infrastructure.You might excel in this role if you:Have substantial experience in full-stack development, with proficiency in modern frontend frameworks (React, Vue, or Angular) and backend technologies (Python, Go, or Node.js).Possess a track record of building scalable, high-performance web applications for complex systems.
MongoDB, Inc.
The Team Join our Platform Engineering department within the Site Reliability Engineering (SRE) team, where we oversee critical infrastructure and operational functions that bolster our engineering organization. Our responsibilities include managing a multi-cloud Kubernetes infrastructure, networking solutions, load balancing for both public-facing and internal needs, and developing observability and alerting systems. The Fleet Management team is pivotal in providing the core runtime environment that enables our developers to create and deliver exceptional products. We handle the complete lifecycle of our Kubernetes fleet, ensuring cluster reliability and security through components like CoreDNS, cert-manager, and Gatekeeper. As we expand our infrastructure to accommodate new products and use cases, we are leading a transition from a Terraform-based Infrastructure as Code (IaC) model to an Operator-driven lifecycle management approach. This role is available in our offices located in Austin, Boston, Los Angeles, New York City, Raleigh, or San Francisco, or you can work remotely within the United States.
Schedule: Sunday–Thursday, 10:00 PM – 6:00 AMAbout the RoleJoin our vibrant overnight operations team as a Fleet Operations Specialist at 8fleet Inc. In this pivotal role, you will ensure our rideshare fleet is impeccably maintained, safe, and prepared for service every morning. Your contributions will be vital in bridging the transition from our evening to morning operations, guaranteeing that each vehicle is efficiently received, maintained, and launched without delay.This dynamic and physical position is perfect for individuals who take pride in vehicle maintenance and thrive in a fast-paced environment, tackling problems head-on.Key ResponsibilitiesEvening Operations (PM Close)Receive returning vehicles, collecting keys, tablets, and driver equipment.Document any reported issues and perform inspections for damage or warning lights.Conduct light cleaning (vacuuming, surface wipe-downs) and replenish basic supplies.Preventative Maintenance & Vehicle CarePerform basic maintenance tasks such as oil and filter changes, tire pressure checks, and fluid top-offs (coolant, wiper fluid, etc.).Replace wipers, bulbs, and other minor components as necessary.Log maintenance activities and report larger repair needs to the Fleet Manager or vendors.Fleet Readiness & CleaningPressure wash vehicles and prepare them for the next shift.Ensure garage cleanliness, tool organization, and inventory management of supplies.Morning Launch Support (AM Start)Prepare vehicles for dispatch in anticipation of driver arrivals.Address basic app, phone, or equipment issues to facilitate timely launches.Administrative DutiesMaintain records of vehicle assignments, inspections, and maintenance logs.Document incidents, driver notes, and any missing items.Assist the operations team with various fleet or logistics tasks as needed.QualificationsHigh school diploma or equivalent.Strong attention to detail and problem-solving skills.Ability to work effectively in a fast-paced, physically demanding environment.
Amplitude
Are you ready to lead a talented team of software engineers in a dynamic environment? As the Software Engineering Manager at Amplitude, you will play a pivotal role in shaping our technology and guiding our team to deliver exceptional software solutions. You will be responsible for overseeing multiple projects, ensuring technical excellence, and fostering a culture of innovation and collaboration.
Waymo is at the forefront of autonomous driving technology, committed to becoming the world's most trusted driver. Born from the Google Self-Driving Car Project in 2009, Waymo has tirelessly worked on the Waymo Driver—The World’s Most Experienced Driver™—to enhance mobility accessibility and save lives lost in traffic accidents. Our technology not only powers Waymo's fully autonomous ride-hailing service but is also adaptable across various vehicle platforms and applications. With over ten million rides completed and more than 100 million miles driven autonomously on public roads, our system operates across 15+ U.S. states.As part of Waymo's Product Management Team, you will engage in pioneering initiatives to bring our groundbreaking autonomous driving technology to market. Our team excels in crafting straightforward solutions for intricate challenges by coordinating cross-functional efforts to advance our technology and associated products. We prioritize understanding customer needs, business objectives, and technological capabilities. We approach our tasks with humility, foster collaboration in problem-solving, and are driven by a bold vision for the future.In this hybrid role, you will report to a Director of Product Management.Your Responsibilities Include:Collaborating with a talented team of engineers, product managers, data scientists, and operations personnel to ensure safe fleet monitoring and passenger safety at scale.Crafting a product roadmap that anticipates customer and business demands while responding to urgent field events.Fostering effective partnerships with operations, systems engineering, systems safety, and product data science to create safe user experiences.Aligning the product roadmap with long-term commercial objectives, ensuring scalability, quality maintenance, and achievement of cost targets.
Join Condor Software as a Full-Stack Platform EngineerAt Condor, we are revolutionizing the financial infrastructure that supports clinical development. With billions invested annually in discovering and developing new therapies, we strive to connect clinical operations and finance into a cohesive system. By integrating real-time financial intelligence, we empower R&D and finance leaders with the tools they need to make informed, high-stakes decisions.We are an AI-driven, pharma-native infrastructure provider, scaling industry standards in collaboration with top-tier partners. Our platform facilitates prediction, control, and execution in the most complex R&D environments worldwide.The Importance of Your RoleHaving established ourselves as a trusted partner for enterprise teams, we are now focused on the challenging task of scaling our platform to meet increasing demands. As a rapidly growing company, backed by prominent investors like Felicis and 645 Ventures, this is a unique opportunity to contribute to the foundational infrastructure that will redefine how therapies reach patients.Your ResponsibilitiesAs a Full-Stack Platform Engineer, you will be pivotal in building and scaling the core platform that supports the financial intelligence infrastructure relied upon by leading biopharma companies. This role encompasses critical engineering tasks at the intersection of backend systems, cloud infrastructure, and intelligent automation, with a strong emphasis on reliability and scalability.Your primary focus will be on backend architecture, where you'll design and implement services that drive complex financial and operational workflows. You'll be instrumental in shaping data flow, workflow orchestration, and enabling emerging AI-driven capabilities. This role goes beyond simple integration; you'll be crafting robust primitives that support other teams as our product and customer base expand.Working as a core member of a cross-functional product team, you will closely collaborate with product managers, designers, quality engineers, and data specialists to transition features from concept to production. While backend expertise is crucial, you will also engage across the stack to ensure the platform's capabilities are effectively leveraged.
Join Waymo, a leader in autonomous vehicle technology, as a Fleet Campaigns Program Manager. In this role, you will oversee and optimize campaigns that enhance the efficiency and performance of our fleet operations. You will collaborate with cross-functional teams to ensure that our vehicles are deployed strategically while maintaining the highest standards of safety and customer satisfaction.
At Crusoe, we are on a mission to accelerate the availability of energy and intelligence. We are developing cutting-edge technology that empowers individuals to pursue ambitious AI projects without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology at Crusoe. Here, you will spearhead innovative initiatives, effect real change, and collaborate with a team that drives responsible and transformative cloud infrastructure.About the Role:As the Senior Staff Product Manager for Fleet Management, you will define and execute the product strategy for Crusoe’s fleet management solutions, ensuring the dependable and efficient operation of large-scale GPU and CPU infrastructures. You will be responsible for shaping how Crusoe provisions, monitors, maintains, and optimizes thousands of servers dedicated to AI training and inference workloads, converting operational needs into scalable product functionalities.Your role encompasses the entire lifecycle of computing resources—from initial provisioning to continuous health monitoring, maintenance, and decommissioning. You will operate at the nexus of infrastructure operations, engineering, and customer experience, ensuring fleet systems enable high utilization, minimize downtime, and facilitate the rapid scaling of customer workloads.As a Senior Staff Product Manager, you will lead pivotal cross-functional initiatives, shape technical and product strategies across various teams, and bear responsibility for the success of fleet management systems that have a direct influence on organizational goals. You will be recognized as a key leader and subject matter expert, driving outcomes for complex products with extensive strategic implications.What You'll Be Working On:Steer the vision and product strategy for fleet management, covering provisioning, lifecycle management, health monitoring, maintenance orchestration, and performance optimization.Drive results throughout the entire product lifecycle, from discovery to launch and iterative enhancements for fleet management capabilities.Establish product direction for systems managing large-scale GPU and CPU infrastructure, ensuring reliability, utilization, and operational effectiveness.Lead cross-functional initiatives across engineering, infrastructure operations, networking, and customer success to deliver cohesive fleet management solutions.Foster consensus with technical leads and senior management on architectural decisions and product roadmaps.
Figma, Inc.
Join Figma as a Software Engineering Manager specializing in Observability. In this pivotal role, you will lead a dynamic team of engineers in developing cutting-edge solutions that enhance visibility and performance across our platform. Your expertise will drive the design and implementation of observability tools that empower our engineering teams to optimize their workflows, ensuring the robustness and reliability of our applications.
At NerdWallet, we are committed to empowering individuals to make informed financial decisions. Our team comprises exceptional individuals who thrive in an inclusive, flexible, and candid environment. Whether you choose to work remotely or in the office, we prioritize your well-being, professional development, and the impact you can make. We believe that when one of us elevates our skills, the whole team benefits.As part of NerdWallet’s Platform team, you will oversee the systems that serve as the backbone of our consumer experience. This includes management of our centralized product data platform, partner ingestion pipelines, publishing and click-tracking infrastructure, GraphQL gateway operations, and our high-traffic, headless WordPress CMS. These platforms deliver precise, compliant, and high-performance product and content experiences to millions of users on both web and mobile platforms. We are searching for a Senior Engineering Manager to lead this team in modernizing legacy services into scalable and reliable systems while advancing our vision of a decoupled, adaptable platform that facilitates quicker publishing, enhanced observability, and future growth.In the role of Senior Engineering Manager for Platform Systems, you will guide and support a team of engineers in delivering high-quality, scalable, and secure software that aligns with NerdWallet’s product and business objectives. You will collaborate closely with Product Managers and other cross-functional partners to define the roadmap, prioritize tasks, and eliminate obstacles, while nurturing strong engineering practices and a culture of continuous improvement. Your responsibilities will include ensuring technical quality, team well-being, and daily operations, while mentoring engineers, making strategic technical decisions, and balancing immediate deliverables with long-term sustainability, compliance, and reliability.This position reports to the Director of Engineering.Opportunities for Impact:Lead, mentor, and develop a high-performing engineering team responsible for NerdWallet’s platform systems, including the Content Platform, CMS, and Product Data Platform.Collaborate with Product Managers and cross-functional teams to strategize, prioritize, and execute the product roadmap.Champion consistent adherence to software development best practices, including code quality, testing, documentation, and operational excellence.Influence and guide technical and architectural decisions to ensure solutions are scalable, secure, reliable, and compliant with regulatory standards.Balance immediate project needs with long-term project vision and maintainability.
Gridware
About GridwareGridware is an innovative technology firm based in San Francisco, committed to safeguarding and optimizing the electrical grid. We have pioneered a revolutionary grid management approach known as Active Grid Response (AGR), which emphasizes the monitoring of electrical, physical, and environmental factors that influence grid reliability and safety. Our cutting-edge AGR platform leverages high-precision sensors to identify potential issues early, facilitating proactive maintenance and fault prevention. This holistic strategy aids in enhancing safety, minimizing outages, and ensuring the grid operates with maximum efficiency. Gridware is supported by prominent climate-tech and Silicon Valley investors. For further details, please visit www.Gridware.io.Role OverviewWe are looking for a talented Staff Software Engineer to act as a pivotal technical force within our team, enhancing the overall software engineering capabilities through architectural innovation, mentorship, and fostering a culture of excellence. In this role, you will design and develop the essential software systems that drive Gridware's platform. This encompasses everything from backend services that oversee our distributed network of devices to the front-end interfaces that visualize grid health, fleet diagnostics, and real-time field events.Your responsibilities will span the entire technology stack, building and scaling systems that integrate hardware, firmware, and cloud infrastructure to enable dependable communication, fleet visibility, and expedited decision-making. This position offers significant ownership and impact, allowing you to influence how our technology supports and protects critical infrastructure at scale.
Sign in to browse more jobs
Create account — see all 8,525 results

