Site Reliability Engineer

NebiusRemote - United States

Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

Strong proficiency in Linux systems, with expertise in Python and Bash scripting for automation purposes. Proven ability to troubleshoot complex system issues, covering hardware, software, and networking problems. Excellent analytical and problem-solving abilities, with a focus on optimizing system performance. Fluent working proficiency in English.

About the job

Why Join Nebius?
At Nebius, we are pioneering a transformative approach to cloud computing tailored for the global AI economy. Our mission is to equip our clients with innovative tools and resources that address real-world challenges, all while minimizing infrastructure costs and eliminating the need for extensive in-house AI/ML teams. Here, you will collaborate on the forefront of AI cloud infrastructure, working with some of the industry's most talented leaders and engineers.

About Us
Based in Amsterdam and publicly traded on Nasdaq, Nebius boasts a diverse presence with R&D centers across Europe, North America, and Israel. Our team comprises over 1,400 professionals, including more than 400 highly skilled engineers with profound expertise in hardware and software engineering, complemented by a dedicated in-house AI R&D team.

The Role

Nebius is currently seeking a Site Reliability Engineer to join our Hardware Infrastructure team. While there is an opportunity to work from our Amsterdam office, this position is also available remotely within the United States.

The Hardware Infrastructure team is responsible for designing, developing, and supporting systems integral to the data center lifecycle, including:

Functional and load testing systems.
Monitoring engineering equipment in our data centers (power supply, air and water cooling, etc.).
Monitoring IT equipment: racks, servers, JBODs, JBOGs, power shelves, and network devices.
Asset tracking.
Managing hardware repair tasks.
Server production oversight.

Key Responsibilities:

Ensure fault tolerance, scalability, and continuous operation of our services.
Employ cutting-edge technologies to resolve a variety of infrastructure challenges.
Implement and enhance CI/CD processes.

Qualifications:

Strong proficiency in Linux systems, with expertise in Python and Bash scripting for automation purposes.
Proven ability to troubleshoot complex system issues, covering hardware, software, and networking problems.
Excellent analytical and problem-solving abilities, with a focus on optimizing system performance.
Fluent working proficiency in English.

Preferred Qualifications:

A keen interest in backend development.
Experience in designing, developing, and maintaining hardware infrastructure.

About Nebius

Nebius is redefining cloud computing for the AI economy, providing tools that help customers tackle real challenges without incurring hefty infrastructure costs. Our culture fosters innovation, collaboration, and professional growth.

Similar jobs

1 - 20 of 54,315 Jobs

Search for Site Reliability Engineer – OpenStack / Private Cloud Operations

54,315 results

Select all on this page (20)

Apply

Site Reliability Engineer – OpenStack / Private Cloud Operations

dyneits

Full-time|Remote|Remote

dyneits is hiring a Site Reliability Engineer focused on OpenStack and private cloud operations. This remote role supports EST and North America time zones and is available as a full-time or long-term contract position. Role overview This position centers on maintaining production support, troubleshooting, and ensuring platform reliability for OpenStack-based private clouds. The engineer will work hands-on with Linux, networking, and storage systems. Collaboration with internal engineering teams and direct interaction with customers are key aspects of the job. What you will do Diagnose and resolve complex issues in OpenStack and Linux environments. Support and manage OpenStack services, including Nova, Neutron, Cinder, and Keystone. Perform root cause analysis to implement long-term solutions. Participate in incident management and on-call rotations. Monitor system performance, availability, and reliability. Work with engineering teams to implement fixes and improvements. Communicate with customers through various channels. Carry out system optimization and performance tuning tasks. Requirements Deep understanding of Linux internals and system performance. Experience with kernel tuning, troubleshooting, file systems, and disk management. Familiarity with partitions, LVM, SCSI multipath, and basic Ceph knowledge. Ability to troubleshoot IO and performance issues. Understanding of DHCP, DNS, VLANs, network bonding, and routing concepts. Hands-on experience with OpenStack services (Nova, Neutron, Cinder, Keystone). Strong troubleshooting and debugging skills, including root cause analysis. Experience supporting production environments and handling customer-facing technical issues. Nice to have Basic knowledge of Kubernetes concepts. Familiarity with monitoring tools like Prometheus and Grafana. Understanding of metrics, logging, and alerting systems. Basic scripting skills in Python or Go. Experience with automation and observability practices. Soft skills Strong problem-solving and analytical thinking. Ability to perform in high-pressure production settings. Clear and effective communicator. Proactive approach to preventing issues. Comfortable working in remote, distributed teams.

Apr 30, 2026

Apply

Cloud Site Reliability Engineer

AYR Global IT Solutions

Full-time|On-site|New York

As a Cloud Site Reliability Engineer, you will be responsible for deploying innovative solutions within the public cloud environment, specifically utilizing AWS services. You will create and manage configuration templates designed for scalable infrastructure, including AWS components like EFS, EC2, and RDS. Collaborating closely with the Scrum Master, you will ensure the project requirements are met within an agile development setting.Key Responsibilities:• Contribute to architectural design to enhance system consistency, security, maintainability, and flexibility.• Assist architects in creating highly scalable and automated deployments for diverse applications.• Develop configuration templates using established architectural blueprints.• Ensure the development of robust and scalable services across public cloud platforms, including AWS and GCP.• Monitor and assess system performance to ensure optimal operation.

Aug 8, 2017

Apply

OpenStack Cloud Automation Engineer

StemXpert1

Full-time|On-site|Plano

Join our innovative team at StemXpert1 as an OpenStack Cloud Automation Engineer. In this role, you will be instrumental in enhancing our cloud infrastructure by automating processes and ensuring optimal performance. Your expertise will help drive our cloud initiatives and improve efficiency across our operations.As an OpenStack Cloud Automation Engineer, you will collaborate with cross-functional teams to design and implement automation solutions, streamline workflows, and manage cloud resources effectively.

Mar 26, 2015

Apply

Senior Cloud Engineer (OpenStack)

Megazone

Full-time|On-site|Frisco, TX

Job SummaryJoin our innovative team at Megazone as a Senior Cloud Engineer specializing in OpenStack! We are searching for a passionate and highly skilled individual with extensive expertise in architecting, deploying, and managing expansive private cloud infrastructures. The successful candidate will possess in-depth knowledge of all OpenStack modules, exceptional networking and security capabilities, and hands-on experience in migrating virtual machines from VMware ESXi to OpenStack. As a vital contributor to our cloud strategies, you will champion innovation while ensuring the reliability, scalability, and security of our cloud infrastructure.Key ResponsibilitiesDesign, implement, and oversee a highly scalable OpenStack private cloud infrastructure.Lead the migration process of existing VMware ESXi virtual machines to the OpenStack platform, ensuring a seamless transition with minimal downtime.Administer and manage all facets of the OpenStack ecosystem, including Nova, Neutron, Cinder, Keystone, Glance, and others.Implement and manage software-defined networking (SDN) and network function virtualization (NFV) solutions within the OpenStack environment.Ensure security and compliance of the cloud infrastructure by establishing and enforcing security best practices, policies, and procedures.Automate cloud infrastructure provisioning, configuration, and management using tools such as Ansible, Puppet, or Chef.Monitor the health, performance, and capacity of the OpenStack cloud, proactively addressing any issues.Collaborate with development and operations teams to facilitate the deployment of applications and services on the OpenStack cloud.Provide technical leadership and mentorship to junior team members.Stay updated with the latest OpenStack advancements and industry best practices, making recommendations for continuous improvement.

Jan 30, 2026

Apply

Senior Site Reliability Engineer

Juul Labs

Full-time|$158K/yr - $227K/yr|Remote|Remote - United States; United States of America

ABOUT JUUL LABS: At Juul Labs, we are dedicated to revolutionizing the experience of adult smokers by transitioning them away from traditional combustible cigarettes. Our mission is to eliminate their use and prevent underage access to our products. We tackle this global health challenge with a focus on quality, innovation, and research. Supported by prominent technology investors, we aim for excellence not only in our products but also in our talent acquisition. We embrace diversity and are united by our mission. We are seeking the world's best engineers, scientists, designers, product managers, operations experts, and customer service professionals. If you are ready to advance your career with us, we encourage you to explore this opportunity. ROLE OVERVIEW: As a Senior Site Reliability Engineer (SRE), you will take ownership of the operational stability and performance of Juul's hybrid cloud infrastructure (Nutanix, AWS/GCP). Your responsibilities will include leading automation initiatives, ensuring reliability in architecture, and serving as the go-to expert for critical incident escalation to guarantee a scalable and efficient platform. Nutanix Platform Management Responsibilities: Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and manage Prism Central for multi-cluster operations. Exhibit expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations and automation. Create automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform. Manage VM templates, golden images, and standardized deployment catalogs. Design disaster recovery solutions utilizing Leap, Protection Domains, and metro clustering. Implement network micro-segmentation with Nutanix Flow, including RBAC and encryption tactics. Lead Level 3 troubleshooting through advanced diagnostics and log analysis. Configure high availability and optimize performance for critical workloads. Oversee AHV networking with OVS bridges, VLANs, and implement resource reservations. Architect and maintain hybrid cloud solutions across Nutanix HCI, AWS, and GCP environments. Cloud Platform Engineering Responsibilities: Further responsibilities in cloud platform engineering will be communicated during the interview process to ensure alignment with your expertise.

Apr 30, 2026

Apply

Site Reliability Engineer

Redis

Full-time|On-site|United States

Role Overview Redis is looking for a Site Reliability Engineer based in the United States. This role focuses on keeping cloud infrastructure reliable, available, and high-performing. The position involves close collaboration with teams across the company to design scalable systems and address operational challenges. What You Will Do Work with engineers and other stakeholders to build and maintain scalable cloud systems Troubleshoot infrastructure issues to minimize downtime and service interruptions Develop and implement automation that improves operational efficiency

Apr 13, 2026

Apply

Site Reliability Engineer

dev2

Full-time|On-site|Boston

As a Site Reliability Engineer at dev2, you will play a crucial role in ensuring the reliability and performance of our services. You will work closely with development and operations teams to build and maintain scalable systems, troubleshoot issues, and implement best practices in reliability engineering. Your expertise will help us deliver exceptional service and maintain our commitment to quality.

Dec 11, 2023

Apply

Senior Site Reliability Engineer - Core Cloud Engineering

Vultr

Full-time|Remote|Remote - United States

About UsAt Vultr, we are dedicated to revolutionizing cloud infrastructure by making it accessible, efficient, and cost-effective for enterprises and AI innovators globally. With 32 strategically located data centers worldwide, we proudly serve hundreds of thousands of customers across 185 countries, offering dynamic solutions such as Cloud Compute, Cloud GPU, Bare Metal, and Cloud Storage. As of December 2024, Vultr achieved a remarkable $3.5 billion valuation through equity financing, solidifying our position as the largest privately-held cloud infrastructure provider.Our Commitment to EmployeesComprehensive medical benefits with 100% company-paid premiums for employee-only plans, including dental and vision coverage.A robust 401(k) plan with 100% matching up to 4%, featuring immediate vesting.Annual professional development reimbursement of $2,500.Generous leave policy including 11 holidays, paid time off accrual, and rollover options.Increased PTO at 3-year and 10-year anniversaries, a one-month paid sabbatical every five years, and annual anniversary bonuses.$500 for remote office setup in the first year and $400 each subsequent year for new equipment.Internet reimbursement of up to $75 per month.Gym membership reimbursement up to $50 per month.Company-paid subscription to Wellable for wellness initiatives.Join Our TeamVultr is seeking a Senior Site Reliability Engineer in our Core Cloud Engineering team, reporting directly to the Director of Core Cloud Engineering. This position requires extensive knowledge in large-scale distributed systems, infrastructure automation, and hypervisor platform operations. The ideal candidate will excel in systems engineering with an emphasis on reliability, scalability, and observability to ensure our cloud services deliver optimal performance and resilience for our 1.5 million users.Key ResponsibilitiesProduction Control Plane Operations: Manage and scale Vultr’s control plane, ensuring consistent availability, accuracy, and performance across our global data centers.Hypervisor & Infrastructure Reliability: Develop, implement, and sustain automation processes for managing hypervisor fleets (KVM, QEMU, libvirt) and their supporting infrastructure.

Nov 5, 2025

Apply

Senior Site Reliability Engineer I

Axon

Full-time|$134.3K/yr - $214.8K/yr|Hybrid|Boston, Massachusetts, United States

Become a Force for Good at Axon.At Axon, we are dedicated to our mission of protecting life. We tackle society's most pressing safety and justice challenges through our innovative ecosystem of devices and cloud software. Collaboration is at the heart of what we do; we connect with transparency and empathy, valuing diverse perspectives from our customers, communities, and team members.Life at Axon is dynamic, challenging, and impactful. Here, you will take initiative and make a real difference. Continuously evolve as you contribute to a mission that matters at a company where your contributions are valued.Your ImpactAs a Senior Site Reliability Engineer within the APX SRE CloudOps team, you will architect and build the cloud infrastructure and automation platforms critical to Axon's product engineering teams. You will design solutions for multi-cloud environments (Azure, AWS), ensure FedRAMP compliance, and oversee large-scale Kubernetes platforms managing production workloads across various regions. A significant aspect of your role will involve coding: developing services, APIs, and internal tools using languages like Go and Python. Additionally, you will participate in on-call rotations and incident response, leveraging operational insights to enhance reliability and guide platform investments. This position merges software engineering expertise with cloud architecture at scale and production ownership.Location: This role is based in our Atlanta, Seattle, or Boston office and operates on a hybrid schedule. We prioritize in-person collaboration, requiring team members to work on-site from Tuesday to Friday, with the option to work remotely on Mondays, unless a workplace accommodation is approved. We believe that connection fosters innovation, and our in-office culture is designed to promote meaningful teamwork, mentorship, and shared success.

Apr 10, 2026

Apply

Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)

Corelight

Full-time|$172K/yr - $215K/yr|On-site|North America

Corelight’s mission is to help organizations strengthen cybersecurity by turning network data into actionable intelligence. The company’s products are built on open-source technologies like Zeek, Suricata, and YARA, supporting faster incident response and proactive threat detection for customers with demanding security needs. Role overview This Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE) position focuses on ensuring the reliability, performance, and security of Corelight’s cloud platform for the Federal region. The role covers day-to-day infrastructure operations, with a strong emphasis on system availability, latency management, performance tuning, monitoring, incident response, and capacity planning. Maintaining a FedRAMP-compliant environment is a core responsibility, along with collaborating across teams to meet strict security and compliance requirements. What you will do Operate and manage cloud infrastructure, prioritizing high availability and low latency Monitor and optimize system performance for reliability Handle incident response and plan for future capacity Maintain adherence to FedRAMP compliance standards Collaborate with cross-functional teams to uphold security and compliance Implement automation and 'everything as code' practices to build scalable infrastructure Support and maintain core services that securely process large data volumes Key technologies Zeek Suricata YARA Requirements U.S. citizenship required Some work may need to be performed by U.S. citizens on U.S. soil Location This is a remote role based in North America.

Apr 22, 2026

Apply

Site Reliability Engineer - Remote

HomeVision

Full-time|Remote|United States

At HomeVision, we are pioneering innovations in real estate valuation to foster a more efficient, transparent, and equitable housing market. By harnessing advanced technologies such as Natural Language Processing (NLP), computer vision, and large language models (LLMs), we are transforming the appraisal process, enabling appraisers to enhance their productivity. Backed by Initialized Capital, we are experiencing rapid growth and are on the lookout for a dynamic Site Reliability Engineer (SRE) to aid in our scaling efforts.Key ResponsibilitiesDesign and manage the infrastructure supporting our SaaS offerings, predominantly utilizing AWS.Develop tools and oversee platform components to assist our development teams.Engage in software development initiatives, focusing on areas such as authentication, reliability, and observability.Support daily operations, including setting up testing environments and overseeing deployments.Address IT-related tasks like user onboarding and account management.Maintain a flexible work schedule while ensuring availability until 6 PM Pacific Time for internal support and monitoring.QualificationsMinimum of 2 years of experience in Site Reliability Engineering or cloud operations, with AWS experience preferred.At least 1 year of software development experience.A data-driven mindset.A readiness to work across cloud infrastructure and IT as required.Meticulous attention to detail and a commitment to creating high-quality systems.Eligibility CriteriaCandidates must reside in the US or Puerto Rico.Currently, we are unable to sponsor work visas; thus, candidates must be authorized to work in the US without sponsorship.Preferred QualificationsFamiliarity with Terraform or other Infrastructure as Code (IaC) tools.Interest and experience in database administration.Candidates located in Seattle or San Francisco will receive additional consideration.Our OfferingsCompetitive salary, equity, and comprehensive health benefits.Significant ownership and autonomy in your role.Support for your professional development and growth.A fully remote and flexible work environment.We request that no recruiters or automated submissions apply.

Aug 29, 2025

Apply

OpenStack Engineer (Kilo Version) at Cygnus Professionals Inc. | Philadelphia

Cygnus Professionals Inc.

Contract|On-site|Philadelphia

Greetings, Associates!We hope this message finds you well! We have an exciting opportunity for a highly skilled OpenStack Engineer to join our Cloud Services Team in Philadelphia. This role is essential for collaborating with vendors and the engineering team to design, implement, and deploy cutting-edge systems and software.As part of the Site Reliability Engineering (SRE) team at Comcast, you will tackle operational challenges as software problems to ensure maximum availability, performance, and capacity utilization of our robust OpenStack-based cloud. Our mission is to enhance operational awareness for our clients while utilizing automation to deliver business value more rapidly.In this role, you will be responsible for providing L3 production support, troubleshooting issues that L1/L2 teams cannot resolve, and driving continuous improvements within our cloud infrastructure. The position involves supporting data centers across the U.S. that serve Comcast customers, ensuring reliability for platforms such as X1 and Xfinity Home.If you are passionate about creating highly available, secure, and scalable cloud platforms, we want to hear from you!

Apr 10, 2017

Apply

Senior Site Reliability Engineer - Cloud Platform

Zilliz

Full-time|On-site|Redwood City

Join Zilliz, a pioneering startup at the forefront of developing cutting-edge vector database solutions designed for enterprise-grade AI applications. Founded by the visionary engineers behind Milvus, the leading open-source vector database, we are on a mission to revolutionize data management for AI applications, making vector databases accessible to every organization. At Zilliz, you will play a crucial role in shaping the future of AI.

May 15, 2025

Apply

Senior Site Reliability Engineer

Comtech LLC

Contract|On-site|Seattle

Position: Senior Site Reliability Engineer Location: Seattle, WADuration: 12 monthsInterview: In-person for local candidates or via Phone + SkypeAs a Senior Site Reliability Engineer, you will play a pivotal role in the ongoing maintenance and administration of enterprise-level internet systems. Your primary responsibility will be to diagnose and resolve operational issues, ensuring the seamless functioning of our infrastructure. You will also be tasked with developing tools and scripts to enhance these processes.Collaboration with various teams will be essential to document our enterprise infrastructure and monitoring systems effectively. Additionally, you'll oversee the planning and execution of projects ranging from small to large scale within our Technology teams, reporting directly to your manager. This role demands a high level of technical expertise in both traditional enterprise systems and cutting-edge cloud-native applications.If you share our belief that a simple cup of coffee can transform lives and enhance experiences, we invite you to join us in delivering exceptional services to customers worldwide.

Sep 1, 2017

Apply

Senior Site Reliability Engineer

Fivetran, Inc.

Full-time|$196K/yr - $245K/yr|Hybrid|Oakland, California, United States, AMER

Fivetran helps organizations move data smoothly into their data warehouses, making information ready for analysis without extra engineering or ongoing upkeep. Every day, more companies use Fivetran to support a data-driven approach to their work. About the Senior Site Reliability Engineer Role Fivetran builds and maintains data pipelines that support the modern data stack for thousands of businesses. The Site Reliability Engineering (SRE) team ensures that this infrastructure remains stable, scalable, and dependable as our platform grows. What You Will Do Work closely with engineering, product management, support, and sales engineering teams to improve the reliability of the Fivetran Data Platform Take responsibility for the performance and reliability of Fivetran’s infrastructure Strengthen deployment pipelines to support continuous delivery and safe releases Lead and participate in incident response, ensuring issues are resolved quickly and lessons are applied Help maintain the stability and growth of our infrastructure as usage expands Work Location and Schedule This is a full-time, hybrid role based in Oakland, California. The team meets in the office two days each week, with the rest of the time offering remote flexibility. In-person days focus on collaboration and team development.

Apr 14, 2026

Apply

Senior Site Reliability Engineer, Platform & Cloud FinOps

Hopper

Full-time|Remote|New York - Remote

About the RoleJoin Hopper's dynamic Cloud FinOps team as a Senior Site Reliability Engineer. We oversee an extensive infrastructure within Google Cloud, empowering hundreds of engineers to deliver exceptional experiences to millions of users globally.If you are enthusiastic about automation and optimizing systems for performance and reliability, we want to hear from you.You will focus on building scalable, secure, and optimized infrastructure while solving practical problems with straightforward, cost-effective solutions.Daily ResponsibilitiesEngage in projects that enhance cost efficiency, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing data storage solutions based on usage patterns, such as implementing cold storage for infrequently accessed data.Ensuring optimal autoscaling configurations for databases and compute resources.Enhance current cost attribution processes to provide transparency for all teams regarding their expenditures.Participate in incident support, including on-call rotation for platform incidents, collaborating with teams across the Americas and Europe to ensure continuous support.Contribute to a small but highly efficient team of SREs.

Mar 5, 2026

Apply

Innovative Site Reliability Engineer

Ivanti

Full-time|Remote|United States, Remote

Are you an innovative Site Reliability Engineer eager to join a collaborative team focused on customer satisfaction and excellence? At Ivanti, we are passionately committed to working together and making a significant impact for our customers and each other. Elevate your career by helping us deliver cutting-edge solutions in a dynamic and empowering environment.Why This Role Matters:The Site Reliability Engineer position merges infrastructure, networking, operating systems, automation, development, and application administration. This hands-on technical role thrives in a fast-paced atmosphere. The ideal candidate will have substantial experience managing cloud-based SaaS applications, with a focus on resolving traditional operational challenges through automation and software. A high standard of excellence, customer-centric mindset, and the ability to conduct in-depth technical analysis of code, app servers, databases, load balancers, operating systems, and networks are essential.Our Site Reliability Engineering (SRE) team is expanding and collaborates closely with Product Engineering, Security, and Support. We are responsible for the reliability, deployment, and ongoing operation of Ivanti Cloud services. Your contributions will help elevate our existing platform through observability, release automation, chaos engineering, and more.

Mar 10, 2026

Apply

Site Reliability Engineer

Nebius

Full-time|Remote|Remote - United States

Why Join Nebius?At Nebius, we are pioneering a transformative approach to cloud computing tailored for the global AI economy. Our mission is to equip our clients with innovative tools and resources that address real-world challenges, all while minimizing infrastructure costs and eliminating the need for extensive in-house AI/ML teams. Here, you will collaborate on the forefront of AI cloud infrastructure, working with some of the industry's most talented leaders and engineers.About UsBased in Amsterdam and publicly traded on Nasdaq, Nebius boasts a diverse presence with R&D centers across Europe, North America, and Israel. Our team comprises over 1,400 professionals, including more than 400 highly skilled engineers with profound expertise in hardware and software engineering, complemented by a dedicated in-house AI R&D team.The RoleNebius is currently seeking a Site Reliability Engineer to join our Hardware Infrastructure team. While there is an opportunity to work from our Amsterdam office, this position is also available remotely within the United States.The Hardware Infrastructure team is responsible for designing, developing, and supporting systems integral to the data center lifecycle, including:Functional and load testing systems.Monitoring engineering equipment in our data centers (power supply, air and water cooling, etc.).Monitoring IT equipment: racks, servers, JBODs, JBOGs, power shelves, and network devices.Asset tracking.Managing hardware repair tasks.Server production oversight.Key Responsibilities:Ensure fault tolerance, scalability, and continuous operation of our services.Employ cutting-edge technologies to resolve a variety of infrastructure challenges.Implement and enhance CI/CD processes.Qualifications:Strong proficiency in Linux systems, with expertise in Python and Bash scripting for automation purposes.Proven ability to troubleshoot complex system issues, covering hardware, software, and networking problems.Excellent analytical and problem-solving abilities, with a focus on optimizing system performance.Fluent working proficiency in English.Preferred Qualifications:A keen interest in backend development.Experience in designing, developing, and maintaining hardware infrastructure.

Apr 23, 2026

Apply

Senior/Staff Site Reliability Engineer

fal

Full-time|On-site|San Francisco

Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!

Feb 23, 2026

Apply

Senior Site Reliability Engineer, Cloud FinOps - 100% Remote

Hopper

Full-time|Remote|Boston - Remote

About the RoleHopper is seeking a skilled Senior Site Reliability Engineer to join our innovative Cloud FinOps team. Our team manages extensive infrastructure within Google Cloud that supports hundreds of engineers, delivering exceptional experiences to millions of users globally.Do you have a passion for automation and a commitment to optimizing systems? We want someone who strives to ensure our infrastructure is scalable, reliable, secure, and efficient.You will tackle practical challenges, creating straightforward and dependable solutions that are cost-effective and user-friendly.Daily ResponsibilitiesLead projects aimed at enhancing cost efficiencies, such as:Minimizing network egress costs by eliminating unnecessary headers.Optimizing warehouse data utilization and selecting the most suitable storage solutions, like cold storage for seldom-accessed buckets.Ensuring efficient autoscaling for databases and computational resources.Enhance cost attribution methods to provide transparent visibility of expenses across all teams.Participate in incident support and be a part of the on-call rotation for platform incidents, collaborating with teams across North America and Europe (ensuring you'll get your rest!). You will address challenges engineers face with our infrastructure and review pull requests that require platform oversight.Join a compact, high-performing team of Site Reliability Engineers.QualificationsExtensive experience in Site Reliability Engineering, DevOps, Software Engineering, or Systems Engineering.Strong troubleshooting abilities.Proficient in system design with robust analytical skills.Excellent communication skills.Familiarity with major cloud providers, particularly Google Cloud.Proficient in SQL.Experience with containers, Kubernetes, and tools like Kustomize and Helm.Knowledge of Service Mesh, preferably Istio.Understanding of networking principles, including DNS, TLS, certificates, and ingress management.

Mar 5, 2026

Create account — see all 54,315 results