Senior Staff Technical Program Manager - Reliability
DatabricksBellevue, Washington; Seattle, Washington
On-site Full-time
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
Proven experience in technical program management, particularly in reliability-focused projects. Strong understanding of engineering principles and software development lifecycle. Exceptional problem-solving skills and ability to lead cross-functional teams. Excellent communication skills, both verbal and written. Experience with Agile methodologies and project management tools.
About the job
Join Databricks as a Senior Staff Technical Program Manager specializing in Reliability. In this pivotal role, you will lead initiatives that enhance system reliability, ensure seamless operations, and drive innovation within our engineering teams. Your expertise will be critical in shaping our technical roadmap and delivering high-quality solutions that meet our customer needs.
About Databricks
Databricks is a leading cloud-based data platform that empowers organizations to make data-driven decisions. We foster innovation and collaboration within our teams, providing a dynamic environment where you can thrive. Join us to be part of a forward-thinking company that values reliability and excellence.
Similar jobs
1 - 20 of 402 Jobs
Search for Senior Manager Site Reliability Engineering Infrastructure Platform
Discover OktaOkta is recognized as The World’s Identity Company, empowering individuals to securely utilize technology, no matter the device or application. Our versatile and neutral solutions, including the Okta Platform and Auth0 Platform, ensure secure access, robust authentication, and streamlined automation, positioning identity at the heart of business security and advancement.At Okta, we value diverse perspectives and experiences. We don’t seek someone who ticks every box; instead, we welcome lifelong learners who can enhance our team with their unique backgrounds.Join us in creating a world where identity truly belongs to you.The Infrastructure Platform and Shared Services TeamOkta manages the authentication, authorization, and provisioning for millions of users every day. Our services are hosted on Amazon Web Services (AWS), spanning multiple availability zones and geographically diverse regions, designed for high throughput and 99.999% availability. We are searching for a technical leader to help us scale our service with exceptional talent and reliable, cost-effective, and efficient infrastructure, processes, and tools.As the Senior Manager of Infrastructure Platform and Shared Services, you will lead multiple teams focused on Edge networking, Kubernetes (K8s) platforms, Continuous Integration/Continuous Deployment (CI/CD), observability, automation platforms, and tooling.Your ResponsibilitiesDirect the Infrastructure platform and shared services organization, driving initiatives across the SRE and Infrastructure teams.Steer the DevOps transformation, microservices journey, and next-generation infrastructure platform capabilities in collaboration with architects and product engineering.Create a world-class observability platform with advanced monitoring capabilities that enable self-service.Enhance SRE and product engineering velocity by developing robust platforms, powerful tools, and user-friendly self-service capabilities.Oversee the design and operation of scalable, self-service cloud infrastructure platforms (e.g., Kubernetes, service mesh, CI/CD pipelines, Infrastructure as Code (IaC), and Edge Infrastructure).Lead, mentor, and nurture a high-performing team of engineers and managers across platform, infrastructure, and shared services domains.Conduct engineering design evaluations and ensure project completion within resource, budget, and scheduling constraints.
Are you prepared to transform the advertising landscape? At Cognitiv, we are not merely another AdTech firm—we are pioneers reshaping media buying with our advanced Deep Learning Advertising Platform. Since our inception in 2015, we have been leveraging state-of-the-art deep learning technologies and data science to redefine how brands engage with their audiences. Our mission is clear: to infuse intelligence into advertising, delivering unmatched precision, relevance, and impact at scale. Our innovative platform provides advertisers with unparalleled flexibility—whether activating Dynamic Deals through their preferred DSP, utilizing our managed service DSP, or tapping into our groundbreaking ContextGPT product. Joining Cognitiv means being at the forefront of AI-driven advertising solutions, leading change, and achieving remarkable growth in a fast-paced industry. We are currently expanding!The RoleWe are seeking a Senior Site Reliability Engineer to enhance our global network of datacenters and elevate service management across Cognitiv. Your primary focus will be on rapidly expanding our hybrid cloud infrastructure. As a growing organization, we strive to adhere to industry best practices. This position requires an experienced engineer who is eager to learn our environment quickly and help shape our long-term service management strategy.This role will be based in our Bellevue, WA office with a hybrid work schedule of 3 days in-office (Monday/Tuesday/Wednesday) and 2 days remote (Thursday/Friday).ResponsibilitiesDesign, implement, and maintain infrastructure across a widening footprint of co-located deployments.Assess existing physical and network architectures to ensure long-term scalability and growth.Collaborate with engineering and product teams to accurately scope projects based on core business requirements.Lead company-wide initiatives to enhance service management surrounding deployments, monitoring, and disaster recovery.Oversee and maintain shared infrastructure within our AWS environment.RequirementsUnderstanding of contemporary datacenter practices with experience in configuring multi-datacenter deployments.Extensive knowledge of AWS infrastructure, networking, and management practices.Demonstrated experience with infrastructure as code and related tools.
Join CoreWeave as a Senior Site Reliability Engineer specializing in Data Infrastructure. In this pivotal role, you will ensure the reliability and sustainability of our data systems, working closely with our development teams to optimize performance and availability. You will be instrumental in enhancing our infrastructure to support the growing needs of our clients.
Full-time|$194K/yr - $267K/yr|On-site|Bellevue, Washington; Chicago, Illinois; New York, New York; Washington, DC
Empower Every Identity, from AI to HumanAt Okta, we believe that identity is the cornerstone of unlocking the potential of AI. By building a trusted and neutral infrastructure, we enable organizations to confidently navigate this new era. This mission demands individuals who are relentless problem solvers, tackling complex issues with real-world significance. We seek builders and owners who act with urgency and execute with excellence.This is your chance to engage in career-defining work. If you share our commitment to this mission, let’s connect.Join the Workforce Identity Cloud TeamThe Okta Workforce Identity Cloud (WIC) facilitates secure, seamless access for your workforce, allowing you to prioritize strategic initiatives like cost reduction and enhanced customer service.If you thrive on challenges and are passionate about addressing large-scale automation, testing, and tuning issues, we would love to hear from you. The ideal candidate embodies the principle: "If you must do something more than once, automate it" and possesses a strong ability to quickly learn new tools and concepts.Position Overview:The Site Reliability Engineer (SRE) will be pivotal in designing and managing Kubernetes platforms that support cloud-native applications and services. This role emphasizes architecting and overseeing reliable, scalable, and secure Kubernetes-based environments on AWS, ensuring optimal performance and high availability while managing costs and automation. The perfect candidate will have hands-on experience with AWS infrastructure, Kubernetes platform development, Helm charts, Karpenter for scaling, and Istio service mesh.Key Responsibilities:Kubernetes Platform Development: Design, implement, and maintain Kubernetes platforms that are highly available, scalable, and fault-tolerant, ensuring they are optimized for production workloads.AWS Infrastructure Management: Build, manage, and optimize AWS cloud infrastructure, including EKS, ECS, S3, VPCs, RDS, IAM, and more, while implementing best practices for cost management and security.Helm Management: Use Helm to automate and streamline application and service deployment to Kubernetes clusters, creating and maintaining Helm charts for production-ready deployments.Karpenter Implementation: Implement and manage Karpenter for dynamic scaling of Kubernetes clusters to meet workload demands.Istio Service Mesh Management: Configure and manage Istio to facilitate service-to-service communication and security.
Full-time|$147K/yr - $202K/yr|On-site|Bellevue, Washington
About OktaOkta stands as the leader in identity solutions, empowering individuals to securely engage with any technology, on any device, and through any application. Our versatile products, including the Okta Platform and Auth0 Platform, ensure safe access and authentication, placing identity at the forefront of security and business growth.At Okta, we embrace diverse perspectives and experiences. We are not searching for someone who checks all the boxes; rather, we value lifelong learners who can enrich our team with their unique backgrounds.Join us in crafting a future where identity is truly yours.Position Overview:We are looking for a highly skilled Senior Observability Site Reliability Engineer with a focus on Splunk to take ownership and enhance our Splunk ecosystem. In this role, you will go beyond traditional monitoring, creating a comprehensive and scalable Observability Platform that empowers our SRE teams and business stakeholders. You will treat infrastructure as code, leveraging Terraform alongside proficient coding skills in Go, Python, or Ruby to automate deployment across complex distributed systems.Key ResponsibilitiesAutomated Infrastructure: Design, build, and maintain scalable observability infrastructure utilizing tools like Terraform.Splunk Engineering: Enhance the collection, processing, and storage of log data to ensure our Splunk services are highly reliable and low-latency.Incident Response: Engage in on-call rotations and lead post-incident reviews to drive systemic improvements and promote 'observability-driven development.'Automation: Minimize 'toil' by automating the deployment and scaling of observability agents and collectors.
Full-time|$147K/yr - $202.4K/yr|On-site|Bellevue, Washington
Discover OktaAt Okta, we are redefining the identity landscape. As the World’s Identity Company, we empower individuals to securely access any technology, from any device or application, anywhere in the world. Our versatile products, including the Okta Platform and Auth0 Platform, focus on providing secure access, authentication, and automation—making identity central to business security and growth.We value diverse perspectives and experiences and believe that innovation comes from a team of lifelong learners. Join us in our mission to create a world where identity is truly yours.Senior Site Reliability Engineer (SRE) - Security and Data SystemsWe are on the lookout for an experienced Senior Site Reliability Engineer to join our dynamic team. As a leading SaaS company focused on securing extensive systems, this role merges software engineering with systems administration. You will be instrumental in developing and sustaining a highly reliable, scalable, and secure infrastructure. Your expertise will be vital in automating manual processes, proactively addressing complex challenges before they escalate into incidents, and responding to critical incidents, including participating in on-call shifts.
Full-time|On-site|Bellevue, Washington, USA; San Jose, California, USA
Join Zscaler as a Staff Site Reliability Engineer focused on Federal missions. In this role, you will leverage your expertise in reliability engineering to enhance our cloud-based security platform while collaborating with cross-functional teams to optimize performance and scalability. Your contributions will be crucial in ensuring seamless, secure, and high-availability services for our government clients.
Join Armada as a Senior Platform/DevOps Engineer, where you will play a pivotal role in enhancing our cloud-native infrastructure using Kubernetes and Linux. This position involves designing, implementing, and maintaining scalable systems that support our dynamic environment. If you are passionate about automation, container orchestration, and CI/CD practices, we want to hear from you!
Full-time|On-site|Bellevue, Washington; Seattle, Washington
Join Databricks as a Senior Staff Technical Program Manager specializing in Reliability. In this pivotal role, you will lead initiatives that enhance system reliability, ensure seamless operations, and drive innovation within our engineering teams. Your expertise will be critical in shaping our technical roadmap and delivering high-quality solutions that meet our customer needs.
Join Armada as a Senior Security Engineer specializing in Infrastructure. In this role, you will play a crucial part in safeguarding our systems and ensuring robust security protocols are in place across our infrastructure. Your expertise will guide our security strategies, working collaboratively with cross-functional teams to enhance our security posture and protect sensitive data.
Full-time|$276K/yr - $379.5K/yr|On-site|Bellevue, Washington; Chicago, Illinois; San Francisco, California; Washington, DC
Okta is an independent identity provider focused on building secure, trusted infrastructure for both AI and human users. The Technology, Data, and Intelligence (TDI) team supports Okta’s global workforce by providing the technology and systems employees need to succeed. Role overview The Senior Director of Data Platform and Engineering leads a global group of data and analytics engineers. This leader maximizes the value of Okta’s data assets and reports to the VP of Data and Insights. The position requires a balance of deep technical expertise and the ability to engage in strategic business discussions. As a player-coach, the Senior Director builds and mentors the team, guides key initiatives, and maintains a strong data foundation. This role is central to Okta’s AI strategy. As AI becomes more integrated into Okta’s products and operations, the quality and governance of data are increasingly important. The Senior Director ensures that clean, trusted, and well-managed data supports all AI projects, shaping the platform for Okta’s ongoing growth. What you will do Lead and develop a high-performing team: Mentor, grow, and support a diverse group of data and analytics engineers. Foster a culture of excellence, collaboration, and continuous learning. Advance AI enablement: Work closely with AI Engineering teams to ensure data infrastructure and practices are ready for AI development and deployment. Define data governance standards, build quality training datasets, and develop scalable data pipelines for AI and machine learning models. Location Bellevue, Washington; Chicago, Illinois; San Francisco, California; Washington, DC
About the CompanyArmada is a pioneering startup in edge computing, dedicated to delivering cutting-edge computing infrastructure to underserved areas where connectivity and cloud resources are scarce. Our mission is to bridge the digital divide by providing robust technology solutions that can be swiftly deployed anywhere, enabling real-time analytics and AI capabilities at the edge. About the RoleWe are on the lookout for a skilled and detail-oriented Lead Platform Engineer to join our dynamic Edge team. In this vital position, you will utilize your extensive expertise in cloud infrastructure and Kubernetes, while fostering a culture of mentorship, collaboration, and transparent communication.You'll take charge of the architecture, design, automation, optimization, and operation of our Kubernetes-centric platform, supporting our Galleon mobile data centers and cloud integration. Your role will involve developing and managing resilient, secure, and scalable Kubernetes environments across diverse edge locations and cloud infrastructure, ensuring the dependability of our distributed computing platform.
Join Robinhood as a Senior Software Engineer for our Storage Platform team, where you will play a pivotal role in building scalable and efficient storage solutions. You will collaborate with cross-functional teams to design and implement systems that enhance our ability to serve our users effectively. This position offers a unique opportunity to influence the architecture and evolution of our storage technologies.
About ArmadaArmada is an innovative edge computing startup focused on providing cutting-edge computing infrastructure to remote regions where cloud connectivity is limited. Our mission is to bridge the digital divide by deploying advanced technology solutions that enable real-time analytics and AI processing at the edge, making a significant impact in diverse environments. The MissionAt Armada, we are pioneering the infrastructure for the next industrial revolution. Our Galleon data centers have successfully delivered AI-grade computing in some of the world’s most challenging environments—from remote mining sites to offshore platforms. Now, we are taking our ambitions further as we develop the Orbital Galleon: a modular, high-performance orbital data center intended for processing vast data loads in Low Earth Orbit (LEO) and beyond.We are searching for a Senior Hardware Engineer with over ten years of experience in server-level hardware and data center components to spearhead the design and component strategy of our orbital compute nodes. About the Role This senior-level position centers on the physical architecture of our space-based servers. You will not merely purchase servers; you will engineer the fundamental components of our orbital assets. Your responsibilities will include selecting silicon, designing custom PCBAs for power delivery, and collaborating with global component manufacturers to adapt terrestrial data center hardware for the unique challenges of space.You will effectively bridge the divide between high-performance Commercial Off-The-Shelf (COTS) hardware and the exceptional reliability required for an environment that cannot be serviced in orbit. Location: Bellevue, WA Key ResponsibilitiesDevelop and execute component strategies for orbital computing hardware.Design and validate custom PCBAs for enhanced power delivery systems.Engage with global suppliers and manufacturers to ensure hardware meets space-grade standards.Innovate solutions for integrating terrestrial technology into orbital infrastructures.
CoreWeave is looking for a Senior Manager, Data Infrastructure Services to guide the development and operation of its data systems. This position is based in Sunnyvale, CA or Bellevue, WA. Role overview This role centers on leading data initiatives and shaping the company’s data infrastructure. The Senior Manager will focus on building and maintaining systems that support data accessibility and reliability across the organization. What you will do Oversee the design, implementation, and maintenance of CoreWeave’s data infrastructure Ensure data systems align with organizational needs and growth Drive improvements in data accessibility and usability Who thrives here This position suits a strategic thinker who enjoys building and enhancing data ecosystems. Leadership skills and a drive to support innovative data projects are important for success in this role.
Full-time|$165K/yr - $242K/yr|On-site|Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA/ San Francisco, CA
CoreWeave is seeking a Security Engineering Manager to lead the Platform Security team. This position is based in Livingston, NJ, New York, NY, Sunnyvale, CA, Bellevue, WA, or San Francisco, CA. The team’s mission is to embed security into CoreWeave’s Kubernetes-based platform and public cloud environments, supporting high-performance infrastructure for AI and machine learning workloads. Role overview This manager will oversee and expand the Platform Security engineering team, reporting to the Senior Director of Security Foundations. The focus is on hands-on leadership and technical execution, with an emphasis on building and implementing security controls rather than policy development. The role requires close collaboration with Infrastructure, Platform Engineering, Site Reliability Engineering, and other security teams to ensure security measures keep pace with business growth and evolving needs. What you will do Lead and grow the Platform Security engineering team. Integrate security into Kubernetes infrastructure and public cloud platforms such as AWS, GCP, and Azure. Define and execute strategies for cloud security posture, workload isolation, platform guardrails, image integrity, and multi-cloud security. Develop and implement security controls across CoreWeave’s infrastructure. Work closely with other technical teams to align platform security with business needs. The Platform Security team The Platform Security team at CoreWeave engineers systems that enforce security at the infrastructure layer. Their work spans both CoreWeave’s own Kubernetes-based platform and third-party public cloud environments. The team supports GPU-accelerated infrastructure for demanding AI and machine learning workloads, ensuring that both customer and internal services remain secure as CoreWeave’s global presence expands.
As a Senior Software Engineer specializing in Data Infrastructure Services at CoreWeave, you will play a pivotal role in designing and implementing scalable data solutions that power our innovative services. You will collaborate with cross-functional teams to enhance our data architecture and improve data accessibility for various applications.
About the CompanyArmada is an innovative edge computing startup dedicated to providing advanced computing infrastructure to remote locations where connectivity and cloud services are scarce. Our mission is to eliminate the digital divide by deploying cutting-edge technology infrastructure that can be rapidly established in any environment, enabling real-time data processing and AI at the edge. About the RoleWe are looking for a dynamic Lead/Staff Software Engineer to join our Edge organization, where you will be instrumental in shaping the technical vision across our Platform, Security, and Networking domains. This leadership role is perfect for a forward-thinking engineer at the intersection of software, networking, and platform infrastructure.You will define the architectural strategy for the core services that power the Armada Edge Platform, creating the integration framework that connects a hybrid fleet of Azure Local, OpenShift, and Kubernetes clusters globally. In addition, you will oversee the automation software strategy to provision and manage our physical network fabric, ensuring seamless architecture for bootstrapping firewalls and switches alongside our compute nodes. Location: This role is office-based at our Bellevue, Washington office. What You'll Do (Key Responsibilities)Technical Strategy & Leadership: Define and drive long-term technical roadmaps for the Edge organization, ensuring architectural alignment across platform engineering, observability, networking, and security capabilities. Mentor a team of highly skilled engineers, fostering a culture of rigorous engineering excellence.Advanced Edge Orchestration: Architect the next-generation control plane and robust infrastructure services that abstract platform complexity, creating a unified API for deploying to Azure Local, OpenShift, and standard Kubernetes clusters. Direct the engineering of complex, highly scalable Zero-Touch Provisioning (ZTP) workflows for bare-metal compute nodes.Network Systems Architecture: Lead the design of software services in Golang and Python to automate the provisioning and lifecycle management of Juniper SRX firewalls and switches. Champion intent-based networking paradigms, building automated workflows for ZTP of network gear programmatically via Netconf/YANG or XML APIs.Zero-Trust Security Vision: Pioneer the organization's approach to implementing zero-trust security frameworks.
Full-time|$109K/yr - $160K/yr|On-site|Livingston, NJ / New York, NY / Sunnyvale, CA / San Francisco, CA / Bellevue, WA
CoreWeave is The Essential Cloud for AI™, designed and built by pioneers for pioneers. We empower innovators to confidently build and scale AI through our advanced technology, tools, and expert teams. Trusted by top AI labs, startups, and global enterprises, CoreWeave combines exceptional infrastructure performance with profound technical expertise to drive innovation. Founded in 2017, we became a publicly traded company (Nasdaq: CRWV) in March 2025. Discover more at www.coreweave.com.What You’ll DoAbout the TeamThe Enterprise Systems team at CoreWeave is tasked with constructing, maintaining, and scaling the internal platforms that facilitate collaboration and productivity across the organization. This encompasses tools such as Atlassian (Jira, Confluence) and Asana, supporting our engineering, product, and business teams, along with external partners. Our focus is on ensuring reliability, scalability, and the ongoing enhancement of internal tools to empower teams to operate efficiently and effectively.About the RoleIn the role of a Productivity Platforms Engineer, you will be instrumental in the daily administration and enhancement of CoreWeave’s collaboration and work management tools. Collaborating closely with seasoned engineers, you will maintain system reliability, troubleshoot issues, and implement improvements to optimize team workflows. This position involves hands-on configuration, user support, and gaining exposure to automation and integrations. Over time, you will assume responsibility for specific tools and workflows as you develop your technical expertise.
Full-time|On-site|Livingston, NJ / New York, NY / Sunnyvale, CA / San Francisco, CA / Bellevue, WA / Richmond, VA
The Site Selection Manager at CoreWeave identifies and secures locations for new data centers. This position shapes the company’s expansion across the United States by evaluating sites that align with operational requirements and growth plans. Role overview This role focuses on analyzing potential sites, factoring in market trends and strategic priorities. The Site Selection Manager works with cross-functional teams to gather insights and builds recommendations for senior leadership. Collaboration Success in this position depends on strong teamwork. The Site Selection Manager partners with colleagues from various departments to ensure each location supports CoreWeave’s business goals and operational standards. Locations This position is based in one of several locations: Livingston, NJ; New York, NY; Sunnyvale, CA; San Francisco, CA; Bellevue, WA; or Richmond, VA.
Apr 28, 2026
Sign in to browse more jobs
Create account — see all 402 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.