Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
The ideal candidate will possess the following qualifications:Proven experience in Site Reliability Engineering or related fields. Strong understanding of cloud infrastructure, containerization, and orchestration tools. Excellent problem-solving skills and the ability to work under pressure. Ability to collaborate effectively with cross-functional teams.
About the job
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
About fal
fal is at the forefront of innovation, providing cutting-edge solutions that empower businesses to thrive in the digital world. Our commitment to excellence and teamwork drives our success, and we pride ourselves on fostering a culture of creativity and professional growth.
Similar jobs
1 - 20 of 52,773 Jobs
Search for Site Reliability Engineer – OpenStack / Private Cloud Operations
dyneits is hiring a Site Reliability Engineer focused on OpenStack and private cloud operations. This remote role supports EST and North America time zones and is available as a full-time or long-term contract position. Role overview This position centers on maintaining production support, troubleshooting, and ensuring platform reliability for OpenStack-based private clouds. The engineer will work hands-on with Linux, networking, and storage systems. Collaboration with internal engineering teams and direct interaction with customers are key aspects of the job. What you will do Diagnose and resolve complex issues in OpenStack and Linux environments. Support and manage OpenStack services, including Nova, Neutron, Cinder, and Keystone. Perform root cause analysis to implement long-term solutions. Participate in incident management and on-call rotations. Monitor system performance, availability, and reliability. Work with engineering teams to implement fixes and improvements. Communicate with customers through various channels. Carry out system optimization and performance tuning tasks. Requirements Deep understanding of Linux internals and system performance. Experience with kernel tuning, troubleshooting, file systems, and disk management. Familiarity with partitions, LVM, SCSI multipath, and basic Ceph knowledge. Ability to troubleshoot IO and performance issues. Understanding of DHCP, DNS, VLANs, network bonding, and routing concepts. Hands-on experience with OpenStack services (Nova, Neutron, Cinder, Keystone). Strong troubleshooting and debugging skills, including root cause analysis. Experience supporting production environments and handling customer-facing technical issues. Nice to have Basic knowledge of Kubernetes concepts. Familiarity with monitoring tools like Prometheus and Grafana. Understanding of metrics, logging, and alerting systems. Basic scripting skills in Python or Go. Experience with automation and observability practices. Soft skills Strong problem-solving and analytical thinking. Ability to perform in high-pressure production settings. Clear and effective communicator. Proactive approach to preventing issues. Comfortable working in remote, distributed teams.
As a Cloud Site Reliability Engineer, you will be responsible for deploying innovative solutions within the public cloud environment, specifically utilizing AWS services. You will create and manage configuration templates designed for scalable infrastructure, including AWS components like EFS, EC2, and RDS. Collaborating closely with the Scrum Master, you will ensure the project requirements are met within an agile development setting.Key Responsibilities:• Contribute to architectural design to enhance system consistency, security, maintainability, and flexibility.• Assist architects in creating highly scalable and automated deployments for diverse applications.• Develop configuration templates using established architectural blueprints.• Ensure the development of robust and scalable services across public cloud platforms, including AWS and GCP.• Monitor and assess system performance to ensure optimal operation.
Join our innovative team at StemXpert1 as an OpenStack Cloud Automation Engineer. In this role, you will be instrumental in enhancing our cloud infrastructure by automating processes and ensuring optimal performance. Your expertise will help drive our cloud initiatives and improve efficiency across our operations.As an OpenStack Cloud Automation Engineer, you will collaborate with cross-functional teams to design and implement automation solutions, streamline workflows, and manage cloud resources effectively.
Job SummaryJoin our innovative team at Megazone as a Senior Cloud Engineer specializing in OpenStack! We are searching for a passionate and highly skilled individual with extensive expertise in architecting, deploying, and managing expansive private cloud infrastructures. The successful candidate will possess in-depth knowledge of all OpenStack modules, exceptional networking and security capabilities, and hands-on experience in migrating virtual machines from VMware ESXi to OpenStack. As a vital contributor to our cloud strategies, you will champion innovation while ensuring the reliability, scalability, and security of our cloud infrastructure.Key ResponsibilitiesDesign, implement, and oversee a highly scalable OpenStack private cloud infrastructure.Lead the migration process of existing VMware ESXi virtual machines to the OpenStack platform, ensuring a seamless transition with minimal downtime.Administer and manage all facets of the OpenStack ecosystem, including Nova, Neutron, Cinder, Keystone, Glance, and others.Implement and manage software-defined networking (SDN) and network function virtualization (NFV) solutions within the OpenStack environment.Ensure security and compliance of the cloud infrastructure by establishing and enforcing security best practices, policies, and procedures.Automate cloud infrastructure provisioning, configuration, and management using tools such as Ansible, Puppet, or Chef.Monitor the health, performance, and capacity of the OpenStack cloud, proactively addressing any issues.Collaborate with development and operations teams to facilitate the deployment of applications and services on the OpenStack cloud.Provide technical leadership and mentorship to junior team members.Stay updated with the latest OpenStack advancements and industry best practices, making recommendations for continuous improvement.
Role Overview Redis is looking for a Site Reliability Engineer based in the United States. This role focuses on keeping cloud infrastructure reliable, available, and high-performing. The position involves close collaboration with teams across the company to design scalable systems and address operational challenges. What You Will Do Work with engineers and other stakeholders to build and maintain scalable cloud systems Troubleshoot infrastructure issues to minimize downtime and service interruptions Develop and implement automation that improves operational efficiency
As a Site Reliability Engineer at dev2, you will play a crucial role in ensuring the reliability and performance of our services. You will work closely with development and operations teams to build and maintain scalable systems, troubleshoot issues, and implement best practices in reliability engineering. Your expertise will help us deliver exceptional service and maintain our commitment to quality.
About UsAt Vultr, we are dedicated to revolutionizing cloud infrastructure by making it accessible, efficient, and cost-effective for enterprises and AI innovators globally. With 32 strategically located data centers worldwide, we proudly serve hundreds of thousands of customers across 185 countries, offering dynamic solutions such as Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage. As of December 2024, Vultr achieved a remarkable $3.5 billion valuation through equity financing, solidifying our position as the largest privately-held cloud infrastructure provider.Our Commitment to EmployeesComprehensive medical benefits with 100% company-paid premiums for employee-only plans, including dental and vision coverage.A robust 401(k) plan with 100% matching up to 4%, featuring immediate vesting.Annual professional development reimbursement of $2,500.Generous leave policy including 11 holidays, paid time off accrual, and rollover options.Increased PTO at 3-year and 10-year anniversaries, a one-month paid sabbatical every five years, and annual anniversary bonuses.$500 for remote office setup in the first year and $400 each subsequent year for new equipment.Internet reimbursement of up to $75 per month.Gym membership reimbursement up to $50 per month.Company-paid subscription to Wellable for wellness initiatives.Join Our TeamVultr is seeking a Senior Site Reliability Engineer in our Core Cloud Engineering team, reporting directly to the Director of Core Cloud Engineering. This position requires extensive knowledge in large-scale distributed systems, infrastructure automation, and hypervisor platform operations. The ideal candidate will excel in systems engineering with an emphasis on reliability, scalability, and observability to ensure our cloud services deliver optimal performance and resilience for our 1.5 million users.Key ResponsibilitiesProduction Control Plane Operations: Manage and scale Vultr’s control plane, ensuring consistent availability, accuracy, and performance across our global data centers.Hypervisor & Infrastructure Reliability: Develop, implement, and sustain automation processes for managing hypervisor fleets (KVM, QEMU, libvirt) and their supporting infrastructure.
Full-time|$134.3K/yr - $214.8K/yr|Hybrid|Boston, Massachusetts, United States
Become a Force for Good at Axon.At Axon, we are dedicated to our mission of protecting life. We tackle society's most pressing safety and justice challenges through our innovative ecosystem of devices and cloud software. Collaboration is at the heart of what we do; we connect with transparency and empathy, valuing diverse perspectives from our customers, communities, and team members.Life at Axon is dynamic, challenging, and impactful. Here, you will take initiative and make a real difference. Continuously evolve as you contribute to a mission that matters at a company where your contributions are valued.Your ImpactAs a Senior Site Reliability Engineer within the APX SRE CloudOps team, you will architect and build the cloud infrastructure and automation platforms critical to Axon's product engineering teams. You will design solutions for multi-cloud environments (Azure, AWS), ensure FedRAMP compliance, and oversee large-scale Kubernetes platforms managing production workloads across various regions. A significant aspect of your role will involve coding: developing services, APIs, and internal tools using languages like Go and Python. Additionally, you will participate in on-call rotations and incident response, leveraging operational insights to enhance reliability and guide platform investments. This position merges software engineering expertise with cloud architecture at scale and production ownership.Location: This role is based in our Atlanta, Seattle, or Boston office and operates on a hybrid schedule. We prioritize in-person collaboration, requiring team members to work on-site from Tuesday to Friday, with the option to work remotely on Mondays, unless a workplace accommodation is approved. We believe that connection fosters innovation, and our in-office culture is designed to promote meaningful teamwork, mentorship, and shared success.
Full-time|$172K/yr - $215K/yr|On-site|North America
Corelight’s mission is to help organizations strengthen cybersecurity by turning network data into actionable intelligence. The company’s products are built on open-source technologies like Zeek, Suricata, and YARA, supporting faster incident response and proactive threat detection for customers with demanding security needs. Role overview This Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE) position focuses on ensuring the reliability, performance, and security of Corelight’s cloud platform for the Federal region. The role covers day-to-day infrastructure operations, with a strong emphasis on system availability, latency management, performance tuning, monitoring, incident response, and capacity planning. Maintaining a FedRAMP-compliant environment is a core responsibility, along with collaborating across teams to meet strict security and compliance requirements. What you will do Operate and manage cloud infrastructure, prioritizing high availability and low latency Monitor and optimize system performance for reliability Handle incident response and plan for future capacity Maintain adherence to FedRAMP compliance standards Collaborate with cross-functional teams to uphold security and compliance Implement automation and 'everything as code' practices to build scalable infrastructure Support and maintain core services that securely process large data volumes Key technologies Zeek Suricata YARA Requirements U.S. citizenship required Some work may need to be performed by U.S. citizens on U.S. soil Location This is a remote role based in North America.
At HomeVision, we are pioneering innovations in real estate valuation to foster a more efficient, transparent, and equitable housing market. By harnessing advanced technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we are transforming the appraisal process, enabling appraisers to enhance their productivity. Backed by Initialized Capital, we are experiencing rapid growth and are on the lookout for a dynamic Site Reliability Engineer (SRE) to aid in our scaling efforts.Key ResponsibilitiesDesign and manage the infrastructure supporting our SaaS offerings, predominantly utilizing AWS.Develop tools and oversee platform components to assist our development teams.Engage in software development initiatives, focusing on areas such as authentication, reliability, and observability.Support daily operations, including setting up testing environments and overseeing deployments.Address IT-related tasks like user onboarding and account management.Maintain a flexible work schedule while ensuring availability until 6 PM Pacific Time for internal support and monitoring.QualificationsMinimum of 2 years of experience in Site Reliability Engineering or cloud operations, with AWS experience preferred.At least 1 year of software development experience.A data-driven mindset.A readiness to work across cloud infrastructure and IT as required.Meticulous attention to detail and a commitment to creating high-quality systems.Eligibility CriteriaCandidates must reside in the US or Puerto Rico.Currently, we are unable to sponsor work visas; thus, candidates must be authorized to work in the US without sponsorship.Preferred QualificationsFamiliarity with Terraform or other Infrastructure as Code (IaC) tools.Interest and experience in database administration.Candidates located in Seattle or San Francisco will receive additional consideration.Our OfferingsCompetitive salary, equity, and comprehensive health benefits.Significant ownership and autonomy in your role.Support for your professional development and growth.A fully remote and flexible work environment.We request that no recruiters or automated submissions apply.
Greetings, Associates!We hope this message finds you well! We have an exciting opportunity for a highly skilled OpenStack Engineer to join our Cloud Services Team in Philadelphia. This role is essential for collaborating with vendors and the engineering team to design, implement, and deploy cutting-edge systems and software.As part of the Site Reliability Engineering (SRE) team at Comcast, you will tackle operational challenges as software problems to ensure maximum availability, performance, and capacity utilization of our robust OpenStack-based cloud. Our mission is to enhance operational awareness for our clients while utilizing automation to deliver business value more rapidly.In this role, you will be responsible for providing L3 production support, troubleshooting issues that L1/L2 teams cannot resolve, and driving continuous improvements within our cloud infrastructure. The position involves supporting data centers across the U.S. that serve Comcast customers, ensuring reliability for platforms such as X1 and Xfinity Home.If you are passionate about creating highly available, secure, and scalable cloud platforms, we want to hear from you!
Join Zilliz, a pioneering startup at the forefront of developing cutting-edge vector database solutions designed for enterprise-grade AI applications. Founded by the visionary engineers behind Milvus, the leading open-source vector database, we are on a mission to revolutionize data management for AI applications, making vector databases accessible to every organization. At Zilliz, you will play a crucial role in shaping the future of AI.
Position: Senior Site Reliability Engineer Location: Seattle, WADuration: 12 monthsInterview: In-person for local candidates or via Phone + SkypeAs a Senior Site Reliability Engineer, you will play a pivotal role in the ongoing maintenance and administration of enterprise-level internet systems. Your primary responsibility will be to diagnose and resolve operational issues, ensuring the seamless functioning of our infrastructure. You will also be tasked with developing tools and scripts to enhance these processes.Collaboration with various teams will be essential to document our enterprise infrastructure and monitoring systems effectively. Additionally, you'll oversee the planning and execution of projects ranging from small to large scale within our Technology teams, reporting directly to your manager. This role demands a high level of technical expertise in both traditional enterprise systems and cutting-edge cloud-native applications.If you share our belief that a simple cup of coffee can transform lives and enhance experiences, we invite you to join us in delivering exceptional services to customers worldwide.
Full-time|$196K/yr - $245K/yr|Hybrid|Oakland, California, United States, AMER
Fivetran helps organizations move data smoothly into their data warehouses, making information ready for analysis without extra engineering or ongoing upkeep. Every day, more companies use Fivetran to support a data-driven approach to their work. About the Senior Site Reliability Engineer Role Fivetran builds and maintains data pipelines that support the modern data stack for thousands of businesses. The Site Reliability Engineering (SRE) team ensures that this infrastructure remains stable, scalable, and dependable as our platform grows. What You Will Do Work closely with engineering, product management, support, and sales engineering teams to improve the reliability of the Fivetran Data Platform Take responsibility for the performance and reliability of Fivetran’s infrastructure Strengthen deployment pipelines to support continuous delivery and safe releases Lead and participate in incident response, ensuring issues are resolved quickly and lessons are applied Help maintain the stability and growth of our infrastructure as usage expands Work Location and Schedule This is a full-time, hybrid role based in Oakland, California. The team meets in the office two days each week, with the rest of the time offering remote flexibility. In-person days focus on collaboration and team development.
About the RoleJoin Hopper's dynamic Cloud FinOps team as a Senior Site Reliability Engineer. We oversee an extensive infrastructure within Google Cloud, empowering hundreds of engineers to deliver exceptional experiences to millions of users globally.If you are enthusiastic about automation and optimizing systems for performance and reliability, we want to hear from you.You will focus on building scalable, secure, and optimized infrastructure while solving practical problems with straightforward, cost-effective solutions.Daily ResponsibilitiesEngage in projects that enhance cost efficiency, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing data storage solutions based on usage patterns, such as implementing cold storage for infrequently accessed data.Ensuring optimal autoscaling configurations for databases and compute resources.Enhance current cost attribution processes to provide transparency for all teams regarding their expenditures.Participate in incident support, including on-call rotation for platform incidents, collaborating with teams across the Americas and Europe to ensure continuous support.Contribute to a small but highly efficient team of SREs.
Are you an innovative Site Reliability Engineer eager to join a collaborative team focused on customer satisfaction and excellence? At Ivanti, we are passionately committed to working together and making a significant impact for our customers and each other. Elevate your career by helping us deliver cutting-edge solutions in a dynamic and empowering environment.Why This Role Matters:The Site Reliability Engineer position merges infrastructure, networking, operating systems, automation, development, and application administration. This hands-on technical role thrives in a fast-paced atmosphere. The ideal candidate will have substantial experience managing cloud-based SaaS applications, with a focus on resolving traditional operational challenges through automation and software. A high standard of excellence, customer-centric mindset, and the ability to conduct in-depth technical analysis of code, app servers, databases, load balancers, operating systems, and networks are essential.Our Site Reliability Engineering (SRE) team is expanding and collaborates closely with Product Engineering, Security, and Support. We are responsible for the reliability, deployment, and ongoing operation of Ivanti Cloud services. Your contributions will help elevate our existing platform through observability, release automation, chaos engineering, and more.
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
About the RoleHopper is seeking a skilled Senior Site Reliability Engineer to join our innovative Cloud FinOps team. Our team manages extensive infrastructure within Google Cloud that supports hundreds of engineers, delivering exceptional experiences to millions of users globally.Do you have a passion for automation and a commitment to optimizing systems? We want someone who strives to ensure our infrastructure is scalable, reliable, secure, and efficient.You will tackle practical challenges, creating straightforward and dependable solutions that are cost-effective and user-friendly.Daily ResponsibilitiesLead projects aimed at enhancing cost efficiencies, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing warehouse data utilization and selecting the most suitable storage solutions, like cold storage for seldom-accessed buckets.Ensuring efficient autoscaling for databases and computational resources.Enhance cost attribution methods to provide transparent visibility of expenses across all teams.Participate in incident support and be a part of the on-call rotation for platform incidents, collaborating with teams across North America and Europe (ensuring you'll get your rest!). You will address challenges engineers face with our infrastructure and review pull requests that require platform oversight.Join a compact, high-performing team of Site Reliability Engineers.QualificationsExtensive experience in Site Reliability Engineering, DevOps, Software Engineering, or Systems Engineering.Strong troubleshooting abilities.Proficient in system design with robust analytical skills.Excellent communication skills.Familiarity with major cloud providers, particularly Google Cloud.Proficient in SQL.Experience with containers, Kubernetes, and tools like Kustomize and Helm.Knowledge of Service Mesh, preferably Istio.Understanding of networking principles, including DNS, TLS, certificates, and ingress management.
About Backblaze Backblaze stands at the forefront of the open cloud movement, revolutionizing customer success with cloud storage designed to optimize budgets, ease administrative burdens, and empower innovators. With our partners, we are liberating customers from rigid, expensive legacy systems and enabling them to harness the full potential of the open cloud. Founded in 2007, we grew our business with less than $3 million in external funding until our traditional IPO on the Nasdaq in 2021. Today, Backblaze boasts over $100 million in revenue, serving over 500,000 customers across more than 175 countries, including businesses, developers, IT professionals, and individuals. While we celebrate our achievements, we are equally excited about the future. We are on the lookout for a Director of SRE to join our team! About the Role:We seek a dynamic and seasoned Director of SRE to lead our Cloud Operations leadership team. In this pivotal role, you will oversee the front-line teams responsible for delivering essential SRE production services. Your mission will be to spearhead initiatives that identify, prioritize, and implement opportunities to enhance our core operational capabilities, affecting a wide range of organizational aspects. As an advocate for engineering excellence, your focus will encompass performance measurement, incident/change management, problem resolution, and process discipline. This role offers a unique chance to significantly influence our company's growth trajectory and shape the future of our global operational footprint. This position may be remote; however, our team thrives on face-to-face collaboration. Given that our leadership is spread across the country, we encourage visits during our organized workshops held throughout the US. You will be responsible for managing a global workforce across various time zones. What You'll Do: As the Director of SRE, you will hold direct accountability for Backblaze’s production infrastructure and performance against key SLOs. You will guide and mentor our global Senior SRE and SRE Level 1 teams, ensuring operational excellence. Additionally, you will share the responsibility of managing demand forecasts and making strategic decisions regarding infrastructure expansion. You will also oversee the budget for all operational tooling and observability. At Backblaze, we cultivate a positive and supportive culture, valuing exceptional talent while investing in the growth and development of our team members.
Join Axon and be a Force for Good.At Axon, we are driven by a mission to protect life. Our team tackles society's most pressing issues of safety and justice through a powerful ecosystem of devices and cloud software. We believe in collaboration, embracing diverse perspectives from our customers, communities, and each other.Working at Axon is dynamic, challenging, and impactful. You will be empowered to take ownership and effect real change while growing in a mission-driven environment that values your contributions.Your ImpactAs a Senior Site Reliability Engineer on the APX SRE CloudOps team, you will be responsible for designing and constructing the cloud infrastructure and automation platforms that support Axon's product engineering teams. You will create solutions for multi-cloud environments (Azure, AWS), ensure compliance with FedRAMP standards, and manage large-scale Kubernetes platforms that handle production workloads across various regions. This role involves extensive coding to build services, APIs, and internal tools utilizing languages such as Go and Python. Additionally, you will take part in on-call rotations and incident response, leveraging your operational experience to enhance reliability and guide platform investments. This position merges software engineering expertise with cloud architecture and production accountability.Location - This position is based in our Atlanta (Peachtree Corners), Seattle, or Boston office and operates on a hybrid schedule. We encourage in-person collaboration and require team members to work onsite from Tuesday to Friday, with the flexibility to work remotely on Mondays unless an approved workplace accommodation is in place. We believe that connection fosters innovation, and our in-office culture is designed to promote meaningful teamwork, mentorship, and collective success.
Apr 9, 2026
Sign in to browse more jobs
Create account — see all 52,773 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.