Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
What You Need to SucceedEducational Background: Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent hands-on experience. Professional Experience: At least 5 years in a DevOps, Infrastructure, or Site Reliability Engineering role within a fast-paced tech environment or startup. Technical Skills: Proficiency in CI/CD tools (e.g., GitHub Actions, ArgoCD), Docker, and Kubernetes. Infrastructure Expertise: In-depth knowledge of cloud services (AWS/GCP), distributed systems, Infrastructure as Code (IaC) tools like Terraform or Pulumi, and secrets management solutions (e.g., Vault, SSM). Observability Acumen: Strong grasp of logging, metrics, and monitoring practices in large-scale distributed systems.
About the job
Join CodeRabbit as a Senior DevOps Engineer!
At CodeRabbit, we are at the forefront of research and development, crafting cutting-edge systems for human-machine collaboration. Our mission is to revolutionize software development by creating the next generation of AI-driven code review tools. These advancements represent a powerful synergy between human creativity and advanced algorithms, allowing us to maximize productivity and elevate code quality to unprecedented heights.
As a Senior DevOps Engineer, you will be instrumental in scaling, securing, and optimizing the infrastructure that fuels our AI-powered developer tools. Collaborating closely with our platform engineers, backend developers, and applied AI teams, you will ensure our systems are robust, efficient, and easy to deploy, all while maintaining high standards of observability and resilience.
This position is ideal for a proactive individual who thrives in dynamic environments, takes initiative with critical infrastructure, and is passionate about developing tools that empower an ambitious engineering team.
About CodeRabbit
CodeRabbit is a pioneering research and development firm dedicated to enhancing human-machine collaboration through innovative AI-driven solutions. Our aim is to build tools that not only improve the efficiency of software development but also foster a collaborative environment between human engineers and advanced algorithms.
Similar jobs
1 - 20 of 11,382 Jobs
Search for Site Reliability Engineer Platform At Coderabbit San Francisco
About CodeRabbitCodeRabbit is a pioneering research and development firm dedicated to creating highly efficient human-machine collaboration systems. Our mission is to develop the next generation of AI-driven code review tools, fostering a harmonious partnership between human creativity and advanced algorithms that far exceed the capabilities of individual engineers. By merging language models with human innovation, we aim to elevate the standards of efficiency and quality in software development.The RoleWe are in search of a talented Site Reliability Engineer (SRE) to become a vital part of our Platform Engineering team located in the Bay Area. In this role, you will play a crucial part in maintaining the high availability, performance, and scalability of CodeRabbit's AI-enhanced code review platform. This position lies at the nexus of software engineering and systems operations, where you will construct the foundational platforms and automation that empower our engineering teams to deploy, monitor, and scale our services with reliability.As a Site Reliability Engineer at CodeRabbit, your responsibilities will include improving the reliability of our essential services that handle millions of code reviews, developing sophisticated automation platforms, and managing the infrastructure that drives our AI analysis engine. You will engage with cutting-edge technologies such as large language models, real-time processing systems, and distributed architectures that function at scale.Key ResponsibilitiesInfrastructure & Platform OwnershipDesign, implement, and maintain scalable infrastructure on Google Cloud Platform to accommodate CodeRabbit's expanding user base and processing needs.Take ownership of and operate essential platform services.Develop and manage Infrastructure as Code using Terraform to guarantee consistent, reproducible, and version-controlled infrastructure deployments.Reliability & Performance EngineeringEstablish and uphold SLI/SLO frameworks for all critical services, ensuring we fulfill our reliability commitments to users.Implement comprehensive monitoring, alerting, and observability solutions utilizing Datadog and custom instrumentation.Conduct in-depth incident response, root cause analysis, and post-mortem processes to continually enhance system reliability.Optimize application and infrastructure performance to manage millions of pull request analyses with minimal latency.
About CodeRabbitCodeRabbit is a pioneering research and development firm that specializes in crafting highly efficient human-machine collaboration systems. Our mission is to develop the next wave of AI-driven code review solutions—a collaborative synergy between human creativity and advanced algorithms that surpasses the capabilities of individual engineers. By integrating state-of-the-art language models with human insight, we aim to redefine the standards of software development efficiency and quality.Role OverviewAs a DevOps Engineer at CodeRabbit, you will be instrumental in scaling, securing, and fortifying the infrastructure that supports our AI-powered developer tools. Collaborating with our platform engineers, backend team, and applied AI specialists, you will ensure our systems are robust, observable, high-performing, and easy to deploy.This position is hands-on and tailored for an individual who excels in a dynamic environment, takes initiative in managing critical infrastructure, and is eager to develop tools that empower an ambitious engineering team.ResponsibilitiesDesign, implement, and manage scalable CI/CD pipelines.Develop and oversee infrastructure as code (e.g., Terraform, Pulumi).Enhance system reliability through effective monitoring, alerting, logging, and failover strategies.Collaborate with platform and backend teams to identify and mitigate performance bottlenecks.Contribute to deployment workflows, environment automation, and developer tooling enhancements.Ensure that infrastructure security and compliance measures are rigorously enforced.
Join CodeRabbit as Our Lead Security EngineerAt CodeRabbit, we are at the forefront of innovation in research and development, dedicated to creating groundbreaking human-machine collaboration systems. Our vision is to revolutionize the future of software development through the integration of Gen AI-driven code reviewers, facilitating an unparalleled partnership between human creativity and advanced algorithms. By harnessing the power of language models and human intellect, we aim to redefine efficiency and quality in software development.Position Summary:We are seeking a seasoned Lead Security Engineer to join our mission of empowering developers with high-performance tools in a rapidly evolving threat landscape. In this pivotal role, you will be responsible for architecting, fortifying, and safeguarding our infrastructure and ecosystem.As the Lead Security Engineer, you will infuse security into all aspects of our product and infrastructure, serving as the guardian of resilience, incident response, and proactive defense at scale.Key Responsibilities:Define the Security Roadmap: Develop and implement a strategic security engineering plan that aligns with CodeRabbit’s agile engineering processes.Enhance Resilience: Advocate for defense-in-depth strategies, including threat modeling, secure design reviews, hardening, and CI/CD integration.Lead Incident Response: Take charge of security incident response and recovery, ensuring effective triage, resolution, and root cause analysis to bolster system integrity.Security Tools & Automation: Develop or integrate security tools (SAST, DAST, SIEM, EDR, monitoring) seamlessly into the developer workflow to maintain high delivery velocity.Integrate Security Practices: Collaborate with engineering and product teams to ensure secure practices are incorporated early in project planning and daily operations.Cultivate Talent & Culture: Contribute to hiring, coaching, and mentoring a resilient security engineering team while promoting security awareness throughout the organization.Establish Compliance & Policy: Develop security standards, frameworks, and processes that evolve with our growth while remaining streamlined and developer-friendly.Qualifications:Proven Experience: 8+ years in security engineering, incident response, or related fields. Leadership experience during critical situations is a plus.Technical Proficiency: In-depth knowledge of security best practices, threat modeling, and incident management.Collaborative Mindset: Strong interpersonal skills and the ability to work effectively across multidisciplinary teams.Adaptability: Eagerness to learn and adapt in a dynamic and fast-paced environment.
Join CodeRabbit as a Senior DevOps Engineer!At CodeRabbit, we are at the forefront of research and development, crafting cutting-edge systems for human-machine collaboration. Our mission is to revolutionize software development by creating the next generation of AI-driven code review tools. These advancements represent a powerful synergy between human creativity and advanced algorithms, allowing us to maximize productivity and elevate code quality to unprecedented heights.As a Senior DevOps Engineer, you will be instrumental in scaling, securing, and optimizing the infrastructure that fuels our AI-powered developer tools. Collaborating closely with our platform engineers, backend developers, and applied AI teams, you will ensure our systems are robust, efficient, and easy to deploy, all while maintaining high standards of observability and resilience.This position is ideal for a proactive individual who thrives in dynamic environments, takes initiative with critical infrastructure, and is passionate about developing tools that empower an ambitious engineering team.
Join Coderabbit as a Full Stack Engineer and be at the forefront of technological innovation. As a key member of our development team, you will design, develop, and maintain web applications that enhance user experience and drive business success.
About CodeRabbitAt CodeRabbit, we are at the forefront of research and development, dedicated to innovating human-machine collaboration systems. Our mission focuses on developing the next generation of AI-driven code reviewers, fostering a harmonious partnership between human creativity and advanced algorithms that far exceed the capabilities of any single engineer. By merging sophisticated language models with human insight, we aim to revolutionize the efficiency and quality of software development.About The RoleWe are seeking a dynamic Director of Demand Generation who will establish a consistent pipeline and a self-sustaining revenue stream for our developer-centric product. This role encompasses ownership of the entire demand generation system, from customer acquisition and conversion to lifecycle management and revenue generation.In this position, you will spearhead strategic initiatives and execution, while scaling operations through the recruitment and mentorship of a small, high-performance team. A hands-on approach is essential as you will engage with tools and workflows related to list building, enrichment, and orchestration, utilizing platforms like Clay.You will collaborate closely with teams across Product Marketing, Product Development, Sales, RevOps, and Design to translate technical value into user activation, growth, and revenue generation.ResponsibilitiesDemand Strategy and Targets: Lead marketing-sourced outcomes, focusing on Product Qualified Leads (PQLs), self-service revenue, and sales-assisted pipeline, while meeting quarterly targets and managing budget allocations.Campaign Leadership: Design and execute integrated campaigns across key use cases, segments, and personas, ensuring alignment of messaging, offers, and conversion paths with Product Marketing Managers and content teams.Channel Ownership: Oversee performance across various channels including paid search, paid social, retargeting, SEO, and collaborative marketing efforts with partners, optimizing successful strategies for scale.Funnel Management and Conversion Rate Optimization (CRO): Enhance conversion rates from initial clicks to signups or installations, leading landing page strategy and experimentation to improve conversion metrics.Lifecycle Management and Product-Led Growth (PLG): Work in tandem with Product and Growth teams to define activation events and PQL thresholds, developing lifecycle programs that boost time-to-value, retention, and expansion.Modern Outbound Workflows Using Clay: Create and manage enrichment and orchestration workflows through Clay, including account list building, scoring, personalization parameters, and CRM routing with safeguards.Measurement and Reporting: Establish clear stage definitions and dashboards to monitor success metrics, partnering with analytics teams for data-driven insights.
Full-time|$175K/yr - $190K/yr|On-site|San Francisco
About CodeRabbitAt CodeRabbit, we are at the forefront of innovation, dedicated to developing cutting-edge systems that enhance human-machine collaboration. Our mission is to redefine the future of software development through the creation of advanced AI-driven code reviewing tools. By merging sophisticated algorithms with human creativity, we aim to elevate the efficiency and quality of software engineering practices.Role OverviewAs an Enterprise Solutions Engineer, you will act as a key technical consultant and advocate for our clients. Collaborating closely with Sales, Product, Engineering, and Customer Success teams, you will guide customers through their pre-sales journey, helping them recognize the transformative value of CodeRabbit's solutions. This role is ideal for individuals who are technically inquisitive and enthusiastic about developing AI-integrated solutions throughout the Software Development Life Cycle (SDLC). Join us in a dynamic, collaborative work environment where your contributions will make a significant impact on our customers' success.Key Responsibilities:Partner with Account Executives to comprehend customer needs and craft tailored technical solutions.Convert business goals into technical specifications and solution architectures.Lead engaging demonstrations and impactful Proof of Value (PoV) sessions to effectively communicate the benefits of the CodeRabbit platform.Assist in addressing security-related inquiries from clients.Continuously provide constructive feedback and work cross-functionally with product and engineering teams to resolve customer challenges and enhance the user experience.Stay informed about industry trends and emerging technologies in AI-powered development tools.Contribute to the enhancement of internal resources, including playbooks, guides, and best practices.Facilitate the transition from pre-sales to post-sales, ensuring a seamless onboarding experience for customers.
About CodeRabbitCodeRabbit is a groundbreaking research and development firm dedicated to creating highly efficient human-machine collaboration systems. Our mission is to develop the next generation of AI-driven code review tools that foster a collaborative relationship between human developers and advanced algorithms, yielding superior outcomes compared to individual efforts. We leverage state-of-the-art language models combined with human creativity to redefine the standards of software development efficiency and quality.Role OverviewAs a Backend Software Engineer at CodeRabbit, you will play a pivotal role in crafting advanced AI applications that transform the code review landscape. Your work will be situated at the convergence of intelligent systems and software engineering, enabling developers to rapidly iterate within intricate environments. Your skills will be invaluable in designing solutions that improve code quality, scalability, and developer productivity.In this position, you will architect and develop essential backend systems that drive our AI agent workflows, context-aware code reviews, repository planning tools, and dashboard interfaces. Collaborating with AI researchers, infrastructure experts, frontend developers, and product strategists, you will contribute to building robust, intelligent backend systems that scale effectively to support our tools for developers.
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
About CodeRabbitAt CodeRabbit, we are pioneering research and development efforts aimed at revolutionizing how humans and machines collaborate. Our mission is to engineer the next wave of AI-driven code reviewers that harness the strengths of both human ingenuity and sophisticated algorithms, resulting in software development processes that are not only efficient but also of superior quality.The RoleAs the Mid-Market Sales Manager, you will be at the forefront of driving and scaling revenue within our mid-market division. You will lead a dedicated team of Account Executives, enhance forecast accuracy, and implement effective sales strategies that yield consistent results.This position balances hands-on execution with strategic optimization: mentoring your team to achieve immediate success while refining the systems that will fuel future growth.What You’ll DoRevenue OwnershipTake ownership of mid-market revenue targets and drive sustainable growth.Ensure a robust pipeline and effective conversion rates at all stages.Guide deals through late-stage negotiations, pricing, and closure.Provide accurate forecasts on a weekly, monthly, and quarterly basis.Develop new sales processes and programs to enhance pipeline and bookings.Team ManagementLead and mentor a team of Mid-Market Account Executives.Implement best practices in discovery, MEDDICC, and value-based selling techniques.Conduct one-on-one meetings, pipeline assessments, deal evaluations, and forecasting discussions.Collaborate with Enablement to recruit and onboard new Account Executives.Process & ScaleStandardize and enhance the mid-market sales processes.Work with RevOps to improve CRM management, dashboards, and key performance indicators.Partner with the Marketing team to ensure high-quality leads and effective campaigns.Collaborate with Customer Success to ensure smooth handoffs, identify expansion opportunities, and manage renewals.Cross-Functional ImpactProvide actionable feedback to Product regarding roadmap gaps and buyer concerns.Share competitive insights and positioning with go-to-market leadership.Contribute to strategies for pricing, packaging, and territory management.
Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco
At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.
About the RoleJoin our dynamic team at allinbits as a Platform Engineer, where your expertise will be vital in designing and maintaining the robust infrastructure that supports our cutting-edge projects. Your role will combine technical acumen with strategic insight, ensuring our development and operational environments are finely tuned for optimal performance, reliability, and scalability.We prioritize experience in our team, especially if you have transitioned from a developer role into DevOps or Site Reliability Engineering (SRE). Your capacity to innovate and construct resilient systems will prove invaluable.In this position, you will utilize tools such as Ansible, Docker, and Hashicorp Nomad to enhance our operations.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA
At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.
The Scaling team at OpenAI builds and maintains the core infrastructure that supports research efforts. This group focuses on enabling rapid progress toward Artificial General Intelligence by providing the systems and tools researchers rely on every day. Their work covers everything from foundational infrastructure to specialized applications, all designed to handle increasing complexity and scale without sacrificing reliability or ease of use. Role overview OpenAI is seeking a Site Reliability Engineer to manage and improve the infrastructure behind its analytics platform. This position centers on supporting production systems that handle data-intensive, low-latency workloads. Key technologies include large-scale ClickHouse clusters, high-throughput Kafka pipelines, and stable integrations with Snowflake. The engineer in this role will turn ambiguous operational challenges into concrete solutions, deliver improvements quickly, and iterate based on real-world feedback. Success in this role means independently setting and raising operational standards, working closely with production systems, and collaborating across teams to ensure reliability at scale. Key responsibilities Manage the full lifecycle of infrastructure: provisioning, upgrades, scaling, and decommissioning using Infrastructure as Code (IaC). Operate and scale ClickHouse clusters, including sharding, replication, capacity planning, tuning, and maintenance. Run Kafka as the primary data ingestion layer, improving throughput, managing lag and backpressure, and ensuring robust failure recovery. Improve latency and reliability for workloads involving heavy data serving and querying. Develop and maintain monitoring and alerting systems, including SLIs/SLOs, dashboards, alert policies, and actionable runbooks. Create and refine incident response protocols, on-call procedures, and postmortem practices. Oversee backup, restore, and disaster recovery strategies, including regular drills. Plan and execute safe rollouts across development, staging, and production environments, using canary deployments and rollback plans. Work daily with software engineers to embed reliability into system design, implementation, and release cycles. Set and promote standards for operational readiness and runbooks, encouraging adoption across teams. Enhance CI/CD pipelines and improve the developer experience for greater speed and safety.
Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.
Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.
Jan 27, 2026
Sign in to browse more jobs
Create account — see all 11,382 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.