Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
We are looking for candidates who possess the following qualifications:Proven experience in Site Reliability Engineering or similar roles. Strong knowledge of cloud platforms and infrastructure management. Proficiency in programming and scripting languages. Experience with monitoring tools and CI/CD pipelines. Excellent problem-solving skills and a collaborative mindset.
About the job
The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient.
What you will do
Design and set up monitoring tools to track the health and performance of systems
Automate routine operational tasks to minimize manual intervention and boost efficiency
Diagnose and resolve complex technical problems that impact infrastructure or services
Support projects aimed at strengthening infrastructure stability and preparing for future growth
Location
This role is located in San Francisco, CA.
About prosper
prosper is a forward-thinking company focused on delivering innovative solutions in the financial technology sector. Our team is dedicated to creating a dynamic and inclusive work environment where every employee can thrive and contribute to our mission of empowering individuals through technology.
Full-time|$204K/yr - $240K/yr|Hybrid|San Francisco, CA, United States
Who We AreSamsara (NYSE: IOT) is a trailblazer in the Connected Operations™ Cloud, a platform that empowers organizations reliant on physical operations to leverage Internet of Things (IoT) data for actionable insights and operational improvements. Our mission at Samsara is to enhance the safety, efficiency, and sustainability of the physical operations that…
Full-time|$204K/yr - $240K/yr|Hybrid|San Francisco - SF9
About UsSamsara (NYSE: IOT) is at the forefront of the Connected Operations™ Cloud, a groundbreaking platform that empowers organizations reliant on physical operations to leverage Internet of Things (IoT) data for actionable insights and operational improvements. We are committed to enhancing the safety, efficiency, and sustainability of the vital physical operations that drive the global economy. Covering over 40% of global GDP, our focus spans critical sectors such as agriculture, construction, field services, transportation, and manufacturing. We are eager to facilitate the digital transformation of these industries on a large scale.Joining Samsara means you will play a pivotal role in shaping the future of physical operations, contributing to a diverse range of product solutions including Video-Based Safety, Vehicle Telematics, Apps and Driver Workflows, and Equipment Monitoring. As part of a recently public company, you will enjoy the freedom and support necessary to make a significant impact as we lay the groundwork for long-term success.Position Overview:We are seeking a Senior Product Design Engineer II to join our dynamic team. In this role, you will oversee the industrial design, architecture, and engineering of one or more Samsara products from their initial concept to mass production. Your design process will be driven by data-derived insights from our telemetry data cloud, combined with a profound, hands-on understanding of our customers, gained through direct engagement at their sites and with their equipment.Throughout the product development lifecycle, you will collaborate closely with Product Management, Electrical Engineering, Firmware, Engineering Project Management, and Hardware leadership to make informed decisions that balance functionality, cost, installation, usability, and aesthetics. Samsara’s Hardware Product Design teams work hand-in-hand with Operations and Supply Chain teams, external laboratories, JDM design resources, and an expanding global supply network. Together, you will deliver outstanding products on schedule and within budget, ensuring that Samsara continues to lead the industry in innovative product design.This position is available to candidates residing in the US. It is a hybrid role requiring 3 days per week in our San Francisco office and 2 days of remote work.
About GridwareGridware is an innovative technology firm headquartered in San Francisco, committed to safeguarding and enhancing the reliability of the electrical grid. We have pioneered a revolutionary approach to grid management known as Active Grid Response (AGR), which meticulously monitors the electrical, physical, and environmental factors influencing grid safety and reliability. Our state-of-the-art AGR platform leverages high-precision sensors to identify potential issues at an early stage, facilitating proactive maintenance and fault resolution. This holistic strategy is designed to bolster safety, minimize outages, and ensure optimal grid performance. We are proud to be supported by prominent climate-tech and Silicon Valley investors. To learn more, visit www.Gridware.io.About the RoleWe are seeking a skilled Senior Hardware Reliability Engineer to lead reliability testing, analysis, and lifetime modeling of various outdoor electronic assemblies. This pivotal role will concentrate on the electronic components of our products, collaborating closely with our mechanical-focused Reliability Engineer and engaging with the broader hardware and cross-functional teams.
Internship|$76.6K/yr - $128.8K/yr|On-site|San Francisco - SF9
About UsSamsara (NYSE: IOT) is revolutionizing the Connected Operations™ Cloud, a transformative platform that empowers organizations reliant on physical operations to leverage Internet of Things (IoT) data for actionable insights and operational enhancements. Our mission is to enhance the safety, efficiency, and sustainability of the physical operations that drive our global economy. Covering over 40% of global GDP, these essential industries include agriculture, construction, field services, transportation, and manufacturing. We are thrilled to digitally transform their operations at scale.Joining Samsara means you will play a pivotal role in shaping the future of physical operations, contributing to a diverse range of innovative product solutions such as Video-Based Safety, Vehicle Telematics, Driver Workflow Applications, and Equipment Monitoring. As part of a publicly traded company, you will enjoy the autonomy and support needed to make a significant impact while we build for the long term.
About Multiply LabsMultiply Labs is an innovative startup located in San Francisco, California, backed by renowned investors in technology and life sciences such as Casdin Capital, Lux Capital, and Y Combinator. Our goal is to develop the world's leading robotic systems and utilize them to make groundbreaking life-saving therapies accessible to everyone.We are transforming the manufacturing process of cell therapies through the creation of advanced robotic systems that automate and scale the production of these crucial treatments. Our cutting-edge robots enable biopharma companies to produce cell therapies efficiently without overhauling their existing processes, thus minimizing regulatory hurdles and risks. Unlike traditional methods that are labor-intensive and costly (often exceeding $1M per patient), our robotic solutions aim to make these vital treatments more affordable and reachable for those who need them.To discover more and view our robots in action, please visit www.multiplylabs.com and follow us on LinkedIn.Position OverviewWe are looking for a dedicated Hardware Reliability Engineer to become an essential part of Multiply Labs’ Reliability Engineering team. As a founding member, you will collaborate closely with the Hardware Product and Systems Integration teams to enhance our designs throughout the entire development lifecycle, from initial prototypes to fully deployed GMP production systems. Your contributions will directly support the delivery of life-saving therapies by ensuring our robots operate seamlessly within the high-stakes biotech environment.
Full-time|$124.1K/yr - $208.5K/yr|Hybrid|San Francisco - SF9
Who We AreSamsara (NYSE: IOT) is at the forefront of the Connected Operations™ Cloud, a transformative platform that empowers businesses reliant on physical operations to tap into Internet of Things (IoT) data. Our aim is to provide actionable insights that enhance safety, efficiency, and sustainability across vital industries such as agriculture, construction, transportation, and manufacturing. By digitally transforming these sectors, which represent over 40% of global GDP, we are contributing to a more efficient and sustainable economy.Joining Samsara means being part of a team that is defining the future of physical operations. You will engage in cutting-edge solutions, including Video-Based Safety, Vehicle Telematics, and Equipment Monitoring, within a supportive environment that fosters innovation and long-term impact.About the Role:We are seeking a Senior Hardware Systems Engineer to enhance our rapidly expanding product line. Your primary responsibility will involve leading the electrical engineering components of product architecture and design, grounded in comprehensive feasibility, design, and cost analyses. This encompasses critical aspects such as component selection, thermal management, and antenna design. You will leverage extensive telemetry and direct customer insights to inform and refine our product designs. Collaborating closely with Product Management, Firmware, and Hardware leadership, you will influence key engineering decisions while mentoring fellow engineers. The role will also require interaction with our US and Taiwan EE teams, as well as our Supply Chain and laboratory resources, to achieve our project goals effectively.This role is hybrid, requiring you to be in our San Francisco, CA office three days a week, with the flexibility to work remotely for two days. Travel may be necessary up to 25% of the time, and proximity to an international airport is essential. We offer relocation assistance for this position and welcome candidates from across the U.S. who are willing to relocate to the Bay Area.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
Full-time|$172K/yr - $209K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to propel the availability of energy and intelligence. We are designing the engine that fuels a future where individuals can ambitiously innovate with AI, all while upholding standards of scale, speed, and sustainability.Join us in the AI revolution powered by sustainable technology at Crusoe. Here, you will spearhead significant innovations, make a lasting impact, and collaborate with a team that is leading the charge in responsible, transformative cloud infrastructure.About This Role:We are on the lookout for a Hardware Production / Sustaining Engineer to enhance Crusoe’s Hardware Systems Engineering team and address critical skill gaps in debugging, validation, and production support of high-performance computing systems. In this role, you will oversee the entire hardware lifecycle—from prototype initiation to mass production—while driving automation, resolving intricate issues, and ensuring reliability across Crusoe Cloud’s GPU- and CPU-based infrastructure.You will collaborate closely with cross-functional teams to support, debug, and optimize hardware platforms at scale, with a specific focus on PCIe, InfiniBand, and NVMe/storage, which are recognized as vital areas for enhanced expertise. Your contributions will significantly influence Crusoe’s capability to deploy and manage sustainable, AI-first computing systems that deliver world-class performance and reliability.What You’ll Be Working On:Lead the entire hardware development and sustaining lifecycle, encompassing feasibility, bring-up, validation, deployment, and ongoing production support.Create and maintain scripting and automation frameworks for hardware testing, diagnostics, and continuous reliability enhancements.Guide deep troubleshooting and debugging across:PCIe (link training, topology, performance issues)InfiniBand (fabric debugging, throughput, connectivity issues)NVMe/storage (performance bottlenecks, firmware interactions, failure analysis)Perform thorough system validation and characterization for GPU, CPU, and high-performance computing platforms.Assist in end-to-end integration and solution testing to guarantee that Crusoe Cloud products fulfill performance, reliability, and scalability standards.Work in tandem with mechanical, thermal, firmware, software, and manufacturing teams to resolve system-level challenges.
Full-time|Remote|San Francisco, CA or Remote (USA)
Join fieldguide as a Senior Site Reliability Engineer, where you will play a pivotal role in ensuring the reliability and performance of our systems. You will collaborate with a talented team to design, implement, and maintain infrastructure solutions that are robust and scalable.Your expertise in both software development and systems engineering will be essential to enhancing our operational frameworks. This position allows for both on-site work in San Francisco and remote working opportunities across the United States.
Full-time|$208K/yr - $253K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to drive the evolution of energy and intelligence. We are developing the technology that fuels a future where individuals can ambitiously harness AI capabilities without compromising on scale, speed, or sustainability.Join us in revolutionizing AI with sustainable solutions at Crusoe. In this role, you will be at the forefront of innovation, making a significant impact while collaborating with a team that is shaping the future of responsible and transformative cloud infrastructure.About This Role:We are looking for a dedicated Hardware Production/Sustaining Engineer to enhance Crusoe's Hardware Systems Engineering team. This position is critical for bridging essential skill gaps in debugging, validation, and production support for high-performance computing systems. You will manage the entire hardware lifecycle—from prototype initiation to large-scale production—focusing on automation, deep troubleshooting, and reliability within Crusoe Cloud’s GPU- and CPU-oriented infrastructure.Your collaboration with cross-functional teams will be vital in supporting, debugging, and enhancing hardware platforms on a large scale, specifically targeting PCIe, InfiniBand, and NVMe/storage, which have been highlighted as key areas for expanded expertise. Your contributions will directly influence Crusoe’s capability to deploy and maintain sustainable, AI-driven computing systems that deliver exceptional performance and reliability.Your Responsibilities Will Include:Leading the complete hardware development and sustaining lifecycle, encompassing feasibility studies, bring-up, validation, deployment, and ongoing production support.Creating and sustaining automation frameworks and scripts for hardware testing, diagnostics, and continual reliability enhancements.Executing in-depth troubleshooting and debugging across:PCIe (including link training, topology, and performance issues)InfiniBand (focusing on fabric debugging, throughput, and connectivity challenges)NVMe/storage (addressing performance bottlenecks, firmware interactions, and failure analyses)Performing extensive system validation and characterization for GPU, CPU, and high-performance computing platforms.Assisting in end-to-end integration and solution testing to guarantee that Crusoe Cloud products fulfill performance, reliability, and scalability standards.Collaborating with teams across mechanical, thermal, firmware, software, and manufacturing domains to troubleshoot and enhance system performance.
Company OverviewEcho Neurotechnologies is an innovative startup at the forefront of Brain-Computer Interface (BCI) technology. We are committed to creating advanced hardware solutions powered by AI, aimed at restoring autonomy for individuals with disabilities and enhancing their quality of life.Team CultureBecome a part of our dynamic team of passionate and skilled professionals. We thrive in a collaborative environment, where you will have the opportunity to take charge of pivotal decisions that shape our future. We prioritize continuous learning and development, encouraging contributions that drive our collective success.Position SummaryWe are looking for a Senior Hardware Engineer with expertise in Mechanical Engineering to validate our cutting-edge Echo hardware systems. You will evaluate custom hardware devices and subsystems, while also spearheading the development and execution of specialized test systems for design verification.Key ResponsibilitiesDesign and prototype mechanical components and assemblies, including rapid prototyping, machining, and injection molding.Develop electromechanical test systems to characterize and assess hardware devices.Create test protocols, implement design verification testing, and manage vendor testing processes.Analyze test data, produce technical reports, and supervise vendor test reports.Generate component and assembly drawings, including tolerance stack-ups and analyses.Plan and conduct design verification activities.QualificationsBachelor's degree in Mechanical Engineering or a related field.A minimum of 7 years of professional experience in engineering electro-mechanical hardware devices.Proficient in hands-on machining and rapid prototyping techniques.Experience in data analysis from physical systems.Familiarity with quality systems and standards.Preferred QualificationsMaster’s degree in Mechanical Engineering or a related field.Strong analytical skills and attention to detail.Ability to work collaboratively in a fast-paced environment.
Full-time|$200K/yr - $250K/yr|On-site|San Francisco
Agency Notice: We are not currently collaborating with recruiting agencies for this role. We kindly ask that you refrain from contacting Vizcom employees regarding this position. Any resumes submitted without prior agreement will be considered unsolicited.About VizcomVizcom is a cutting-edge visual creation platform that merges advanced web tooling with AI-driven workflows. Our technology stack incorporates React/TypeScript for the front end, Node/Koa + PostGraphile for API services, PostgreSQL, Redis, BullMQ for queuing, and a Kubernetes-based production infrastructure.We are seeking a seasoned expert to oversee platform stability and infrastructure, ensuring our system remains reliable, efficient, and resilient as we scale.Role MissionTake full ownership of service reliability: proactively prevent incidents, minimize impact during failures, and guide swift, high-quality recovery during production downtimes.This role involves hands-on technical leadership, granting you the authority to establish reliability standards and enforce production protocols.CompensationBase salary between $200,000 and $250,000, plus significant equity.Your ResponsibilitiesReliability Standards: Define and uphold SLIs/SLOs/error budgets for key user interactions.Resilience of Production Architecture: Implement failure isolation across APIs, workers, queues, and interdependencies to ensure one subsystem's failure does not disrupt core access.Kubernetes Runtime Reliability: Establish probe contracts, deployment standards, graceful shutdown protocols, scaling/resource policies, and startup safety measures.Queue & Job Safety (BullMQ/Redis): Manage poison pill containment and workload segregation.Incident Command Quality: Lead Sev1/Sev2 incident responses from containment to corrective actions.Reliability Operating System: Oversee observability quality (prioritizing signal over noise), on-call efficiency, runbook maintenance, and postmortem discipline.Deployment Safety Authority: Gate risky deployments and enforce reliability protocols whenever production health is compromised.
Join Us at Humble RoboticsAt Humble Robotics, we are pioneers in revolutionizing ground transportation with our innovative autonomous, zero-emissions hauler. Our cutting-edge vision-based AI technology is designed to optimize global logistics networks and significantly reduce freight costs.As part of our dynamic and passionate team, you will work alongside industry veterans and creative thinkers. We believe that while culture cannot be engineered, when it aligns, it creates an extraordinary journey.Experience the thrill of progress like never before. Role OverviewAs a Senior Hardware Validation Engineer, you will lead the design verification (DV) process for our autonomous hauler from start to finish. Your responsibilities will include defining vehicle-level DV requirements and translating them into specific requirements for each module, ensuring compliance with thermal, shock, vibration, EMC, environmental, and electrical stress standards. You will determine whether to conduct tests at external laboratories or develop in-house hardware-in-the-loop (HIL) and environmental test setups, guaranteeing that every component we deliver meets the rigorous DV standards derived from the vehicle.
About Plaud Inc.Plaud is revolutionizing the way professionals enhance productivity and performance with our trusted AI work companion. Our innovative note-taking solutions have gained the admiration of over 1,500,000 users globally since our inception in 2023. We are on a mission to amplify human intelligence by developing next-generation intelligence infrastructure and interfaces that seamlessly capture, extract, and leverage what you say, hear, see, and think.Based in San Francisco, Plaud Inc. is a Delaware-incorporated company that is redefining the boundaries of human-AI collaboration through a unique combination of hardware and software solutions. We adhere to the highest standards of data security and privacy protection, with certifications including ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Discover more about our innovative solutions by visiting https://www.plaud.ai and follow us on Instagram, X, Facebook, LinkedIn, and YouTube.Why You Should Join UsAt Plaud, you will play a pivotal role in shaping the future of human-AI interaction. Here’s what we offer:A thriving, bootstrapped company with a remarkable $250M revenue run rate achieved in just three years.An opportunity to define the next-generation paradigm for human-AI interaction.Direct exposure to cutting-edge AI tools for professionals and a chance to contribute to our global expansion.Collaborate with a passionate team that values innovation, teamwork, and customer success.Advance your career in a culture that promotes continuous learning and rapid career growth.
Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.
About SieveSieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:System throughput and stabilityMonitoring and incident managementSecurity principles, including least-privilege designMinimizing operational burdens for the entire engineering teamYou will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.If a system failure occurs, you take it personally, thriving under the weight of responsibility.What You'll Be DoingCollaborate with engineering to design and validate infrastructure supporting PB-scale workloadsDevelop and manage Terraform-based multi-cloud deploymentsEnhance cloud and data security (SSO, IAM, least privilege access, auditability)Lead incident response efforts and strengthen systems against failuresCreate CI/CD systems to minimize user errors and maximize safetyEstablish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)
Full-time|$144K/yr - $258K/yr|On-site|San Francisco
At Braze, we pride ourselves on cultivating a team that is genuinely approachable, exceptionally kind, and intensely passionate about what we do.We aim to fuel this passion by establishing high standards, promoting teamwork, and fostering a harmonious work-life balance as we collectively navigate rapid global growth, all while striving for greater equity and opportunity both within and outside our organization.To thrive in our environment, you should be prepared to hold yourself and those around you to high standards. There are always opportunities for contribution: acting with autonomy, taking accountability, and being open to new perspectives are fundamental to our ongoing success.Our deep curiosity and eagerness to share diverse passions with one another enrich our culture with a unique vibrancy.If you are motivated to tackle exciting challenges and have a proactive mindset amid change, you will be empowered to make a significant impact here, backed by a sharp and passionate team. If Braze sounds like the right fit for you, we look forward to meeting you!WHAT YOU'LL DOAs a Site Reliability Engineer (SRE), you will be responsible for ensuring the smooth operation of all internal-facing services and platforms, ultimately guaranteeing site uptime. SREs integrate the roles of system administrators and software engineers, applying sound engineering principles, operational discipline, and mature automation techniques to the infrastructure services we deliver. Our expertise spans systems such as networking, the Linux kernel, and specialized interests in scaling algorithms or distributed systems.Our team plays a crucial role in enhancing automation, infrastructure reliability, and empowering Braze’s engineering teams to leverage the infrastructure products and platforms we develop with ease. Braze operates at a massive scale, supporting over 3.3 billion monthly active users across our customers, processing hundreds of billions of data points each month, and delivering billions of messages to end-users daily. Our diverse technology stack includes Ruby on Rails, MongoDB, Redis, Kafka, Kubernetes, and more. As a Senior Site Reliability Engineer at Braze, you will collaborate with your team and consumer engineering groups to continually enhance the infrastructure, automation, and tooling that power our internal products built on these technologies.Main responsibilities:Collaborate with Braze’s engineering teams to:Design products that effectively utilize infrastructure platforms in a scalable and reliable mannerTroubleshoot reliability and scalability issues across all layers of the stack, including products built on our infrastructure platformsImplement monitoring solutions and improve overall system performance...
Why Choose Flux?At Flux, we are transforming the hardware landscape by creating the world's first AI Hardware Engineer. Our mission is to democratize access to cutting-edge hardware development and revolutionize global electronics design and manufacturing.About the OpportunityAs a DevOps Engineer at Flux, you will be integral in ensuring the smooth operation of our innovative platform. Your work will encompass a wide range of full-stack systems, impacting various aspects of our service, including billing, authentication, onboarding, and seamless integrations.Your contributions will directly influence user experience, and your role will be crucial in maintaining operational efficiency as Flux continues to scale.Key ResponsibilitiesEnhance the reliability, availability, and operational health of our production systems.Establish observability standards across services, including metrics, logs, and error tracking.Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) while implementing effective alerting strategies.Collaborate with engineering teams to design robust systems and proactively mitigate operational risks.Develop internal tools to enhance system safety, debugging capabilities, and developer productivity.Manage infrastructure using Pulumi across GCP, AWS, and Firebase.