Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
To succeed in this role, you should have:Proven experience in backend development using modern programming languages such as Java, Python, or Go. Strong understanding of database management systems and cloud services. Experience with microservices architecture and RESTful API development. Excellent problem-solving skills and the ability to work independently as well as part of a team. A degree in Computer Science, Engineering, or a related field.
About the job
Join our dynamic team at unify as a Senior Backend Software Engineer, where you will leverage your expertise to design and implement robust backend systems that enhance our product's performance and scalability. You will collaborate closely with cross-functional teams to develop innovative solutions that meet our customers' needs.
About unify
unify is a leading technology company focused on delivering cutting-edge solutions that drive digital transformation across industries. Our talented team thrives on innovation, collaboration, and a commitment to excellence. Join us to be part of a company that values creativity and technical prowess.
Similar jobs
1 - 20 of 11,942 Jobs
Search for Senior Site Reliability Engineer At Unify San Francisco
About UnifyAt Unify, we're pioneering the first AI-driven system of action for revenue teams. Our innovative approach empowers companies to transform their outbound strategies into a leading growth engine, ensuring that go-to-market execution is observable, repeatable, and scalable. Established in 2023 by visionaries from Ramp and Scale AI, our diverse team boasts experience from industry giants such as Airbnb, Meta, Waymo, and Perplexity.Having achieved an impressive 8x revenue growth in 2024, we proudly serve esteemed clients including Perplexity, Cursor, SoFi, and Justworks. With a dynamic team that has successfully raised $58M from prominent investors like Thrive, Emergence, and OpenAI, we are at the forefront of revolutionizing the future of GTM. Come and be a part of this exciting journey!About the RoleAs a Senior Site Reliability Engineer (SRE) at Unify, you will play a pivotal role in addressing the challenges of scaling and maintaining reliability as we handle immense data volumes and support enterprise clients with stringent uptime standards. Your expertise will span the entire tech stack—optimizing databases, fortifying services, and crafting automation and observability tools to ensure Unify remains fast and dependable at scale.
Join Unify as a Senior Staff Site Reliability Engineer and take the lead in transforming our technology landscape. In this pivotal role, you will spearhead initiatives to enhance our system reliability and performance, ensuring seamless operations across our platforms. Your expertise will guide a dynamic team, driving innovation and implementing best practices in site reliability engineering.
Join our dynamic team at unify as a Senior Backend Software Engineer, where you will leverage your expertise to design and implement robust backend systems that enhance our product's performance and scalability. You will collaborate closely with cross-functional teams to develop innovative solutions that meet our customers' needs.
Full-time|$200K/yr - $280K/yr|On-site|San Francisco Office
About UnifyUnify is revolutionizing the way revenue teams operate by developing the first AI-powered system of action. Our mission is to transform outbound strategies into a robust growth engine, ensuring that go-to-market execution is observable, repeatable, and scalable. Established in 2023 by industry leaders from Ramp and Scale AI, our talented team boasts backgrounds from renowned companies like Airbnb, Meta, Waymo, and Perplexity.In 2024, Unify achieved an impressive 8x revenue growth and serves a diverse clientele including Perplexity, Cursor, SoFi, and Justworks. We are a dynamic team fueled by high energy and intensity, having secured $58M in funding from leading investors such as Thrive, Emergence, and OpenAI. Join us in shaping the future of GTM!About the Role:As a Senior Software Engineer specializing in AI at Unify, you will be at the forefront of innovation, developing new AI products and enhancing our AI platform. Your work will encompass agents, retrieval systems, classification, fine-tuning, reinforcement learning, and LLM inference infrastructure. If you are passionate about creating production-ready AI systems and wish to contribute to cutting-edge applications used by premier GTM teams, this role is for you.
About UnifyAt Unify, we are pioneering the first AI-driven system of action for revenue teams, enabling businesses to transform their outbound strategies into high-performing growth engines. Our focus is on making go-to-market execution measurable, repeatable, and scalable. Founded in 2023 by industry veterans from Ramp and Scale AI, our talented team has diverse experience from leading organizations such as Airbnb, Meta, Waymo, and Perplexity.In 2024, Unify achieved an impressive 8x revenue growth and serves notable clients including Perplexity, Cursor, SoFi, and Justworks. We are a dynamic, high-energy team backed by $58M in funding from Thrive, Emergence, OpenAI, and others. Join us as we shape the future of GTM!About the RoleAs the Staff SRE Tech Lead at Unify, you will be instrumental in enhancing the reliability and scalability of our platform as we handle increasing volumes of data and accommodate customers with stringent uptime requirements. You will define the technical roadmap for reliability engineering, lead a dedicated team of SREs, and collaborate closely with engineering leaders to establish systems and practices that ensure Unify remains both swift and dependable at scale.
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
Full-time|$225K/yr - $295K/yr|On-site|San Francisco Office
About UnifyAt Unify, we're pioneering the first AI-driven system of action for revenue teams. Our innovative platform empowers companies to revolutionize their outbound strategies, transforming them into high-performing growth engines by enhancing go-to-market execution to be observable, repeatable, and scalable. Founded in 2023 by industry veterans from Ramp and Scale AI, our team boasts experience from leading firms such as Airbnb, Meta, Waymo, and Perplexity.In 2024, Unify achieved an impressive 8x revenue growth and proudly serves esteemed clients including Perplexity, Cursor, SoFi, and Justworks. With a dynamic and energetic team, we’ve successfully raised $58M from top investors like Thrive, Emergence, and OpenAI. Join us in building the future of GTM!About the RoleAs an early Engineering Manager at Unify GTM, you will have the unique chance to shape the foundation of a fast-paced, high-impact engineering organization at the forefront of AI technology.You will lead one of our core product teams, taking charge of delivery, quality, and team dynamics, while closely collaborating with product and design to turn visionary ideas into reality. You'll also play a pivotal role in establishing the scaffolding for our scaling efforts: defining our culture, systems, and standards of technical and operational excellence.This is a significant, trusted role during a period of rapid expansion, ideal for someone eager to build, lead, and evolve with an ambitious team.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
About UnifyAt Unify, we are pioneering an innovative AI-driven system of action for revenue teams. Our mission is to revolutionize outbound strategies into a high-performing growth engine, making go-to-market execution transparent, repeatable, and scalable. Founded in 2023 by industry leaders from Ramp and Scale AI, our talented team boasts experience from notable companies like Airbnb, Meta, Waymo, Perplexity, and Monday.com.In 2024, Unify achieved an impressive revenue growth of 8x, serving esteemed clients such as Perplexity, Cursor, SoFi, and Justworks. We are a dynamic and high-energy team backed by $58M in funding from Thrive, Emergence, OpenAI, and more. Join us in shaping the future of GTM!About the RoleAs the first Backend Staff Software Engineer at Unify, you will play a crucial role in scaling our core platform to accommodate our rapid growth. Collaborating directly with the founders, you will bring innovative product ideas to life, establish a clear technical vision, and set engineering standards that will guide our evolving company.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.
About Plaud Inc.Plaud is revolutionizing the way professionals enhance productivity and performance with our trusted AI work companion. Our innovative note-taking solutions have gained the admiration of over 1,500,000 users globally since our inception in 2023. We are on a mission to amplify human intelligence by developing next-generation intelligence infrastructure and interfaces that seamlessly capture, extract, and leverage what you say, hear, see, and think.Based in San Francisco, Plaud Inc. is a Delaware-incorporated company that is redefining the boundaries of human-AI collaboration through a unique combination of hardware and software solutions. We adhere to the highest standards of data security and privacy protection, with certifications including ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Discover more about our innovative solutions by visiting https://www.plaud.ai and follow us on Instagram, X, Facebook, LinkedIn, and YouTube.Why You Should Join UsAt Plaud, you will play a pivotal role in shaping the future of human-AI interaction. Here’s what we offer:A thriving, bootstrapped company with a remarkable $250M revenue run rate achieved in just three years.An opportunity to define the next-generation paradigm for human-AI interaction.Direct exposure to cutting-edge AI tools for professionals and a chance to contribute to our global expansion.Collaborate with a passionate team that values innovation, teamwork, and customer success.Advance your career in a culture that promotes continuous learning and rapid career growth.
Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco
At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.
Full-time|$144K/yr - $258K/yr|On-site|San Francisco
At Braze, we pride ourselves on cultivating a team that is genuinely approachable, exceptionally kind, and intensely passionate about what we do.We aim to fuel this passion by establishing high standards, promoting teamwork, and fostering a harmonious work-life balance as we collectively navigate rapid global growth, all while striving for greater equity and opportunity both within and outside our organization.To thrive in our environment, you should be prepared to hold yourself and those around you to high standards. There are always opportunities for contribution: acting with autonomy, taking accountability, and being open to new perspectives are fundamental to our ongoing success.Our deep curiosity and eagerness to share diverse passions with one another enrich our culture with a unique vibrancy.If you are motivated to tackle exciting challenges and have a proactive mindset amid change, you will be empowered to make a significant impact here, backed by a sharp and passionate team. If Braze sounds like the right fit for you, we look forward to meeting you!WHAT YOU'LL DOAs a Site Reliability Engineer (SRE), you will be responsible for ensuring the smooth operation of all internal-facing services and platforms, ultimately guaranteeing site uptime. SREs integrate the roles of system administrators and software engineers, applying sound engineering principles, operational discipline, and mature automation techniques to the infrastructure services we deliver. Our expertise spans systems such as networking, the Linux kernel, and specialized interests in scaling algorithms or distributed systems.Our team plays a crucial role in enhancing automation, infrastructure reliability, and empowering Braze’s engineering teams to leverage the infrastructure products and platforms we develop with ease. Braze operates at a massive scale, supporting over 3.3 billion monthly active users across our customers, processing hundreds of billions of data points each month, and delivering billions of messages to end-users daily. Our diverse technology stack includes Ruby on Rails, MongoDB, Redis, Kafka, Kubernetes, and more. As a Senior Site Reliability Engineer at Braze, you will collaborate with your team and consumer engineering groups to continually enhance the infrastructure, automation, and tooling that power our internal products built on these technologies.Main responsibilities:Collaborate with Braze’s engineering teams to:Design products that effectively utilize infrastructure platforms in a scalable and reliable mannerTroubleshoot reliability and scalability issues across all layers of the stack, including products built on our infrastructure platformsImplement monitoring solutions and improve overall system performance...
Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA
At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.
Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.
Full-time|$129.6K/yr - $258K/yr|On-site|San Francisco
At Braze, we pride ourselves on our exceptional team – approachable, kind, and deeply passionate about our work.We aim to channel that passion through high standards, strong teamwork, and a commitment to work-life balance as we navigate rapid global growth while promoting equity and opportunity both within and outside our organization.To thrive in this environment, you should be ready to set high expectations for yourself and your colleagues. There are endless opportunities to contribute: exercising autonomy, embracing accountability, and welcoming diverse perspectives are crucial to our ongoing success.Our deep curiosity and eagerness to share our varied passions foster a unique vibrancy within our culture.If you're motivated to tackle exciting challenges and possess a proactive mindset in the face of change, you'll be empowered to make a significant impact here, supported by a talented and passionate team. If Braze feels like the right fit for you, we look forward to meeting you!WHAT YOU'LL DOWe are seeking a Senior Site Reliability Engineer to join our Currents team, responsible for the development, maintenance, and evolution of Currents, our scalable data export system. This system is a robust Kafka-based event pipeline that processes tens of billions of messages daily, enabling our clients to analyze user behavior in near real-time.As a key member of our highly collaborative and skilled team, you will take projects from concept to production while enhancing our existing high-scale systems. Utilizing your experience and teamwork skills, you will address significant engineering challenges associated with managing a critical data streaming system. In your role as a Senior Site Reliability Engineer, you will focus specifically on the aspects of observability, scalability, and reliability strategy across all projects.Specific responsibilities will include:Resolving live performance and reliability issues while preventing future occurrencesWriting and reviewing code, educating engineers, and fostering a culture of reliabilityImplementing sustainable incident response practices and conducting blameless postmortemsEstablishing and promoting standards for monitoring, reliability, and performanceFacilitating collaboration between infrastructure and platform engineering teamsSupporting and enhancing services with a focus on scalability and reliabilityMentoring junior engineers in SRE best practices, software development, and agile project leadership
About UnifyUnify is pioneering the first AI-powered system of action for revenue teams. Our innovative technology is transforming outbound strategies into a powerful growth engine by ensuring that go-to-market execution is observable, repeatable, and scalable. Founded in 2023 by visionaries from Ramp and Scale AI, our team is comprised of experienced professionals from leading companies like Airbnb, Meta, Waymo, and Perplexity.In 2024, Unify achieved an impressive revenue increase of 8x and now serves a diverse clientele, including Perplexity, Cursor, SoFi, and Justworks. We are a dynamic and high-energy team that has successfully raised $58M from esteemed investors such as Thrive, Emergence, and OpenAI. Join us as we shape the future of GTM!About the RoleAs a Staff Backend Engineer at Unify, you will play a crucial role in enhancing the reliability and scalability of our platform, which processes terabytes of data monthly while maintaining stringent uptime requirements for our clients. You will lead a dedicated team of Site Reliability Engineers (SREs) and collaborate closely with engineering leadership to establish systems and practices that ensure Unify remains fast and dependable as we scale.
About UsAt Heidi, we believe healthcare should have a more harmonious flow—one that prioritizes continuous and compassionate care. Our mission is to develop an AI Care Partner that collaborates with healthcare professionals to achieve this vision.We are a diverse team of medical practitioners, engineers, designers, researchers, and visionaries dedicated to creating tools that allow clinicians to concentrate on what truly matters: their patients.In just 18 months, Heidi has enabled healthcare professionals to reclaim over 18 million hours, facilitating 73 million patient visits across 116 countries. We currently support more than two million patient visits globally each week.With nearly $100 million in funding, we are expanding our reach across the US, UK, Canada, and Europe, collaborating with top-tier health systems such as the NHS, Beth Israel Lahey Health, and Monash Health.Your RoleIncident Response and On-Call Duties:Take part in incident management, addressing production issues, aiding in service restoration, and ensuring effective communication throughout. As you gain experience, you'll lead incidents from start to finish.Enhancing Operational Reliability:Identify and address recurring issues and reliability threats, implementing improvements through enhanced alerting, automation, system modifications, or process enhancements.Ownership of Production Environment:Manage and enhance Kubernetes clusters, cloud infrastructure, and core platform services, gradually increasing your ownership as you become more familiar with our systems.Observability Improvement:Refine dashboards, alerts, logs, and traces to enable quicker issue detection and resolution, focusing on actionable insights.Minimizing Operational Toil:Automate routine tasks, streamline runbooks, and enhance tools to simplify on-call responsibilities and daily operations.Facilitating Safe Changes:Enhance deployment methods, rollback strategies, and operational readiness to mitigate the risks of incidents due to changes.Contribution to Operational Practices:Document and maintain runbooks, engage in blameless post-mortems, and assist in refining incident response protocols over time.Collaboration with Engineering Teams:Work closely with product and feature teams to ensure seamless integration and functionality.
Feb 26, 2026
Sign in to browse more jobs
Create account — see all 11,942 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.