Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
We are looking for candidates who:Possess expertise in Java and a strong understanding of SRE practices. Have a passion for messaging platforms, including their setup, monitoring, and maintenance. Demonstrate effective collaboration with infrastructure teams. Are eager to take on complex challenges and learn from experiences. Value diversity and contribute positively to our inclusive culture.
About the job
Join our dynamic team at PIMCO, a premier global asset management firm with a commitment to helping millions of investors achieve their financial aspirations. With over 3,000 employees across 20 offices in 15 countries, we seek innovative thinkers who thrive in a collaborative environment. At PIMCO, we value diversity, hard work, and a continuous learning ethos.
As a Java Site Reliability Engineer (SRE) specializing in Messaging Platforms, you will play a critical role in shaping our technology strategies to enhance operational efficiency. Your responsibilities will include supporting various messaging platforms such as MQ, AMPS, and Kafka, ensuring optimal tool selection and sustainable messaging strategies. You will also focus on improving operational efficiency through advanced tools and monitoring systems.
This position requires a passion for messaging systems, collaborative problem-solving skills, and a strong foundation in software development. You will have the opportunity to contribute to critical business solutions that align with our strategic vision for trading applications.
About PIMCO
PIMCO is a globally recognized leader in asset management, dedicated to helping investors achieve their financial goals through innovative solutions and a robust collaborative culture. We believe in the power of diverse perspectives and continuous improvement in technology and processes.
Join our dynamic team at PIMCO, a premier global asset management firm with a commitment to helping millions of investors achieve their financial aspirations. With over 3,000 employees across 20 offices in 15 countries, we seek innovative thinkers who thrive in a collaborative environment. At PIMCO, we value diversity, hard work, and a continuous learning ethos.As a Java Site Reliability Engineer (SRE) specializing in Messaging Platforms, you will play a critical role in shaping our technology strategies to enhance operational efficiency. Your responsibilities will include supporting various messaging platforms such as MQ, AMPS, and Kafka, ensuring optimal tool selection and sustainable messaging strategies. You will also focus on improving operational efficiency through advanced tools and monitoring systems.This position requires a passion for messaging systems, collaborative problem-solving skills, and a strong foundation in software development. You will have the opportunity to contribute to critical business solutions that align with our strategic vision for trading applications.
Full-time|On-site|Austin, TX/Akron, Ohio/Irvine, CA
Join Restaurant365 as a Site Reliability Engineer II, where you'll play a vital role in ensuring the availability, performance, and reliability of our systems. You will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions that enhance our operational efficiency.
For over 25 years, Realtor.com® has stood as the premier online platform trusted by real estate professionals, seamlessly connecting buyers, sellers, and renters with invaluable insights and expert advice to discover their ideal home. Our comprehensive suite of tools not only transforms the real estate landscape, but also aids consumers in navigating one of life's most significant decisions—making it simple, intuitive, and empowering.Join us in our mission to enable more individuals to find their way home by dismantling barriers, fostering meaningful connections, and instilling confidence with expert guidance.About the RoleWe are on the lookout for a Staff Site Reliability Engineer to become a vital member of our newly established Operations Excellence organization, reporting directly to the Director of Operations Excellence. This pivotal position will define the reliability, observability, and operational excellence of our platform infrastructure that serves millions of users. As a Staff SRE, you will take on a technical leadership role, mentoring others and establishing best practices, while influencing architectural decisions to empower our team of 600+ engineers in delivering outstanding customer experiences.You will engage with crucial platform systems, including EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack, all while implementing chaos engineering practices and spearheading cost optimization initiatives that yield measurable ROI.We are committed to employing the best tools to expedite problem-solving. You will be expected to adeptly utilize AI coding assistants and LLMs to enhance development speed, generate boilerplate code, and troubleshoot intricate debugging scenarios. In addition to basic usage, this role demands the critical judgment to assess AI-generated outputs for security, performance, and accuracy. You should be comfortable incorporating AI tools into your daily tasks to minimize repetitive work, allowing you to concentrate on high-impact architectural and strategic engineering challenges.What You'll DoPlatform Reliability & InfrastructureDesign and maintain highly available AWS infrastructure, including EKS clusters, Fargate (ECS), and multi-region architectures.Take ownership of the reliability of essential services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and associated infrastructure.Establish SLIs, SLOs, and error budgets for Tier 1/2/3 systems; lead architectural reviews focused on reliability and cost-efficiency.Drive...
In 2024, cybercrime rates are anticipated to escalate, as evidenced by the FBI's IC3 report, which highlighted a staggering loss of over $16 billion. The real estate sector, unfortunately, remains a prime target for cybercriminals, particularly through investment fraud and BEC scams. At CertifID, we are committed to combating this threat by offering a secure platform that authentically verifies the identities of transaction participants, validates wire transfer instructions, and identifies potential fraud attempts. Our innovative technology is engineered to reduce risks, ensuring that every transaction is executed with utmost confidence and security.Our success hinges on our exceptional team. Recognized as one of the Best Startups to Work in Austin, we proudly made the Inc. 5000 list and received the award for Best Culture by Purpose Jobs for three consecutive years. Our core values and vision for a world without wire fraud guide us as we strive to create a dynamic work environment where every team member can make a significant impact in enhancing security and combating fraud.Position Overview:We are on the lookout for a Senior Site Reliability Engineer (Senior SRE) to spearhead reliability enhancements within our production SaaS environment. You will play an essential role in developing scalable infrastructure models, advancing our observability efforts, optimizing incident response, and collaborating with engineering teams to integrate reliability into system design and deployment.This position is tailor-made for a seasoned Senior SRE who thrives on tackling intricate operational challenges, building automation solutions, and mentoring fellow engineers.
Full-time|$110K/yr - $128K/yr|Hybrid|Austin, Texas, United States
Join Striveworks as a Site Reliability EngineerAt Striveworks, we empower organizations to leverage artificial intelligence to tackle real-world challenges in national security and business. Our mission is to serve as the command center where data, models, and business outcomes converge.Founded by a team of passionate data scientists and engineers, Striveworks simplifies the journey from deployment to ongoing optimization. We ensure that our clients aren’t just deploying AI; they’re establishing robust systems that are reliable, adaptable, and poised to scale in an ever-changing landscape.As a Site Reliability Engineer, you will play a pivotal role in implementing and managing corporate systems from day one. You will work with an array of systems and infrastructure automation tools while having the opportunity to innovate and enhance our toolset. Your focus will be on developing sustainable solutions that prevent future issues, thereby minimizing operational toil.Your daily responsibilities will include:Developing and maintaining infrastructure as code across private (OpenStack) and commercial (AWS, Azure, GCP) cloud environments.Creating configuration management automation for Windows laptops and Linux servers.Providing comprehensive user support for all corporate systems.This role is based in a hybrid/on-site setting at our northwest Austin office.
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate with diverse teams across application development, infrastructure, and quality assurance to elevate the performance, reliability, and scalability of our web services at Weedmaps.com. As a fully cloud-native organization, we operate all our services within Docker containers on Kubernetes, hosted on AWS. Our culture promotes observability, proactive monitoring, and CI/CD automation, enabling us to release multiple production updates daily. In this role, you will utilize your engineering expertise to improve system monitoring, streamline CI workflows, and refine our deployment pipelines. You will serve as a knowledge resource for development teams, guiding them in utilizing standardized tools for metrics, logging, and deployment processes. Collaborate closely with both development and infrastructure teams to identify key service metrics that go beyond the basics, working with application teams to develop libraries that facilitate easy instrumentation of their services. Your Impact: Collaborate with stakeholders to establish best practices in monitoring and CI/CD pipelines. Troubleshoot issues within our deployment CI pipeline. Promote and support a strong DevOps culture within Weedmaps. Identify automation opportunities and advocate for codification across all processes. Share best practices regarding collaboration, reliability, security, and performance with all partner teams. Take responsibility for the configuration and scaling of applications, ensuring adherence to organizational practices. Develop and enhance synthetic monitoring workflows.
As the leading online platform for real estate professionals for over 25 years, Realtor.com® connects buyers, sellers, and renters with trusted insights and expert guidance to find their ideal home. Our comprehensive suite of tools significantly impacts the real estate industry and enhances the consumer experience, making it simple, understandable, and empowering for individuals navigating one of life's biggest purchases.Join us in our mission to help people find their way home by dismantling barriers to entry, establishing the right connections, and fostering confidence through expert guidance.About the RoleWe are looking for a Senior Site Reliability Engineer to become a crucial member of our newly established Operations Excellence organization, reporting directly to the Director of Operations Excellence. In this pivotal role, you will enhance the reliability, observability, and operational excellence of our platform infrastructure that serves millions of users. As a Senior SRE, you will be a key technical contributor, implementing best practices, addressing complex challenges, and empowering our team of over 600 engineers to deliver outstanding customer experiences.Your responsibilities will include working on critical platform systems such as EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack. You will also play a part in chaos engineering practices and cost optimization initiatives, ensuring measurable ROI.We believe in employing the best tools to solve problems efficiently. You will be expected to adeptly use AI coding assistants and LLMs to accelerate development speed, generate boilerplate code, and resolve complex debugging issues. Beyond mere usage, this role demands the critical judgment to evaluate AI-generated outputs for security, performance, and accuracy. You should be comfortable incorporating AI tools into your daily routines to reduce repetitive tasks, allowing you to concentrate on high-impact architectural and strategic engineering challenges.What You'll DoPlatform Reliability & InfrastructureDesign, implement, and maintain highly available AWS infrastructure, including EKS clusters, Fargate (ECS), and multi-region architectures.Ensure the reliability of essential services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and their supporting infrastructure.Monitor SLIs, SLOs, and error budgets for Tier 1/2/3 systems; participate in architectural reviews focused on reliability and cost-efficiency.Implement reliability patterns such as circuit breakers, graceful degradation, and automatic failover strategies.
At BetterUp, we believe in the power of human transformation, and our approach to the employer-employee relationship reflects that belief.From the moment you engage with us, you will notice a distinct experience. It's not just about filling a position; it's about joining a mission-driven team.Upon accepting an offer, you gain more than just a paycheck—you will receive a dedicated BetterUp Coach, a personalized development plan, and a supportive manager. You'll also be part of an extraordinary team, each member accompanied by their own BetterUp Coach, working on projects that make a real impact.This unique environment fosters a focused and fulfilling work experience. While it may not be for everyone, for those who are passionate and driven, this role represents a transformative career opportunity.Join us for an intense and rewarding journey, where you'll engage in meaningful work within a vibrant and creative culture.If this resonates with you and the job description aligns with your skills, let’s start a conversation.As a hybrid company, we emphasize in-person collaboration when necessary. Employees must be available to work from one of our office hubs a minimum of two days per week, totaling eight days per month. Our US hubs include: Austin, TX; Chicago, IL; New York City, NY; San Francisco, CA; and the Washington, DC metro area. For roles based in Europe, our hubs are located in London, UK, and Amsterdam, NL. Please ensure you can commit to this structure before applying.Key Responsibilities:Utilize AI-driven tools and automation to enhance monitoring, troubleshooting, and maintenance of production systems.Develop and manage cloud infrastructure on AWS, employing Terraform for codifying and version-controlling our environments.Oversee and scale Kubernetes clusters that support BetterUp's platform, ensuring optimal availability and performance.Create intelligent alerting and observability frameworks.Collaborate with engineering teams to integrate reliability into the development lifecycle, proactively addressing operational concerns.Automate incident response processes and establish self-healing infrastructure.Explore and implement cutting-edge AI tools for log analysis, anomaly detection, and predictive maintenance.
qodeworld is seeking a Senior Site Reliability Architect to join the team in Austin, Texas. This position focuses on unified observability, proactive detection, AIOps, and GenAI-driven operations for distributed financial services platforms. The role requires deep technical expertise in designing and maintaining reliable, high-performance systems across complex architectures. Role overview The Senior Site Reliability Architect will drive enhancements in platform reliability and performance. This includes building SLI/SLO-driven monitoring, implementing dynamic thresholds, and developing intelligent alerting and AI/ML-based anomaly detection. The position is central to evolving operational practices from reactive alerting to proactive, insight-driven approaches. Key responsibilities Design and deploy unified observability dashboards that integrate metrics, logs, traces, events, and system topology. Establish and manage SLIs, SLOs, and error budgets aligned with business goals. Create actionable dashboards for operational, engineering, and leadership teams. Implement advanced alerting strategies using both static and dynamic thresholds. Apply AI/ML/AIOps technologies to detect anomalies, forecast incidents, and reduce MTTR. Shift monitoring practices from reactive alerting to proactive insights. Incorporate noise reduction, alert correlation, and root cause analysis. Use baseline modeling, seasonality detection, and anomaly scoring. Oversee and resolve issues in multi-service architectures, including microservices, APIs, Kafka/streaming platforms, and cloud infrastructure (Terraform, Infrastructure as Code). Analyze and trace issues across upstream/downstream dependencies, streaming platforms, infrastructure, and application code. Work extensively with Dynatrace (mandatory requirement). Utilize tools such as OpenTelemetry, Prometheus/Grafana, ELK/EFK, and cloud-native monitoring solutions (AWS, Azure, GCP). Manipulate and enrich telemetry using JSON. Apply GenAI/LLMs for incident summarization, root cause explanations, runbook recommendations, and auto-remediation suggestions. Collaborate with platform teams to operationalize GenAI technologies safely. Requirements 15+ years of experience in Site Reliability Engineering or Production Engineering. Strong background in unified observability, AIOps, and related fields. Proven experience with AI/ML technologies and cloud-native environments.
About Future Secure AI Future Secure AI develops solutions in artificial intelligence for real-world business challenges. The company values courage, precision, and curiosity, and supports an entrepreneurial culture where every team member is recognized. Leadership is experienced and approachable, with a focus on supporting individual growth. Team members work alongside colleagues from diverse backgrounds and contribute to projects that have impact across industries. Role Overview: Site Reliability Engineer The Site Reliability Engineer will design, build, and maintain the platforms that power Future Secure AI's AI Co-Workers. This is a hands-on position with responsibility for reliability throughout the product lifecycle. The role involves close collaboration with product, AI, and engineering teams to ensure platform stability and performance.
Full-time|On-site|Austin, TX, Reston, VA, Boston, MA
Join our dynamic App Platform team as a Senior Platform Engineer, where you'll wear multiple hats including Architect, Developer, Consultant, and Leader. You'll design robust systems, write code for scalable applications, and collaborate with server teams to deliver high-quality infrastructure solutions. Your ability to articulate your experiences through blog posts and presentations will contribute to our open-source initiatives, enhancing our presence in the tech community.
About TelnyxTelnyx is a trailblazer in the realm of global connectivity, actively constructing the future rather than just envisioning it. Our innovative solutions, from designing a private, global, multi-cloud IP network to delivering hyperlocal edge technology via user-friendly APIs, are revolutionizing seamless interconnections among people, devices, and applications.We are motivated by a commitment to revamping outdated processes, automating manual tasks, and addressing genuine challenges through advanced connectivity solutions. Our financial stability and profitability empower us to invest in cutting-edge technologies and cultivate a culture of continuous learning and advancement for our team.Our vision is a world where unrestricted connectivity drives boundless innovation. By joining us, you will play a pivotal role in laying the groundwork for this interconnected future. We are currently on the lookout for enthusiastic individuals eager to contribute to an industry-defining company while enhancing their own skills and career trajectories.The RoleAs a Messaging Compliance Specialist, you will be our key expert on Application-to-Person (A2P) messaging standards. This role involves bridging the gap between regulatory requirements and technical implementation while ensuring our platform and clients stay compliant with the rapidly changing messaging landscape. You'll also assist in developing tools that streamline the compliance process.
Join ICON as a Reliability Engineer II on the innovative Titan Team, where we create cutting-edge print systems. Your expertise will be crucial in guiding the Titan machine into Serial Production. In this role, you will evaluate system performance, pinpoint vulnerabilities, and develop strategies to enhance the overall reliability and consistency of our products. This position is based at our Austin, TX office.
Join Saronic as a Civil/Site Engineer specializing in Infrastructure, where you will play a pivotal role in designing and implementing innovative engineering solutions. You will collaborate with a diverse team of professionals to ensure the successful execution of infrastructure projects, enhancing the quality and sustainability of civil engineering.
Full-time|Remote|Remote (Atlanta, Austin, San Francisco, Seattle)
Role overview ditto is seeking a Senior Platform Engineer, Operator for a fully remote role. This position is open to candidates located in Atlanta, Austin, San Francisco, or Seattle. The focus is on designing, building, and maintaining systems that keep company operations running smoothly and efficiently at scale. What you will do Design and implement systems that improve platform scalability, performance, and reliability. Maintain and enhance the existing infrastructure to support ongoing business operations. Work closely with cross-functional teams to address technical challenges and streamline processes. Use technical expertise and leadership to drive key platform initiatives. Requirements Extensive experience in platform engineering. Strong problem-solving abilities and a collaborative mindset. Proven ability to contribute technical insights and lead engineering projects. Location This is a remote position. Candidates must be based in Atlanta, Austin, San Francisco, or Seattle.
We are seeking a talented and motivated Platform Engineer to join our dynamic team at Allen Control Systems. In this role, you will be responsible for designing, implementing, and maintaining scalable and robust platform solutions that meet our company's needs. Your expertise will contribute to our innovative projects and help shape the future of our technology stack.
Full-time|Remote|Remote (Atlanta, Austin, San Francisco, Seattle)
ditto is hiring an Engineering Manager to lead the Platform team. This remote role is available to candidates based in Atlanta, Austin, San Francisco, or Seattle. The Platform team focuses on building and integrating solutions that support both user experience and the company’s operational needs. Role overview The Engineering Manager will oversee a group of engineers dedicated to platform systems. This position involves guiding the team’s technical direction and ensuring that platform solutions are both scalable and reliable. What you will do Lead and mentor engineers working on platform systems Guide the development of scalable, reliable solutions Work closely with cross-functional teams to align on project goals and deliverables Encourage continuous improvement and technical excellence within the team Ensure platform integrations operate smoothly and support business objectives Requirements Experience managing engineering teams Strong background in building scalable systems Skilled in project management and working across teams Comfortable working remotely and leading distributed teams Proven commitment to team growth and maintaining high standards
About Base Power Base Power is a US-based power company focused on transforming the energy grid. The team works to build a decentralized power system by deploying distributed batteries across the country. Engineers, operators, and problem-solvers at Base Power address major challenges in the energy sector together. Role Overview: Deployment Engineer – Site Survey This Deployment Engineer position connects field operations with systems engineering. The role centers on improving how Base Power evaluates, approves, and executes hardware deployments at multiple locations. The engineer will refine site survey processes and set configuration standards to keep deployments consistent, secure, and reliable. Key Responsibilities Design and maintain internal tools and automated workflows to scale site survey reviews and make data ingestion across systems more efficient. Act as the technical authority for hardware configurations, setting and enforcing criteria for deployment approvals. Define, document, and uphold high standards for site survey reviews, supporting safety, consistency, and operational efficiency as deployment volume grows. Use SQL and analytics tools to examine field data and installation results, spot process bottlenecks, and drive improvements in deployment operations. Build internal dashboards with tools such as Python, JavaScript, or Retool to provide real-time insights into the site survey pipeline and key metrics. Work closely with Field Operations, Hardware Engineering, and Software teams to turn deployment challenges into engineering solutions and technical requirements. Develop and maintain detailed documentation for review criteria, internal tools, configuration standards, and operational processes. Location: Austin, TX
mks2technologies seeks an On-site IT Customer Service Engineer to join the team in Austin, TX. This position acts as the primary contact for IT support, assisting clients with technical issues to help keep their daily operations running smoothly. Key responsibilities Diagnose and troubleshoot technical problems directly at client sites Offer clear, practical solutions and support Maintain attentive and timely customer service with every client interaction Work location This role is fully on-site in Austin, TX. Regular presence at client locations is required.
Apptronik seeks a Data Platform Engineer to shape and enhance the data architecture supporting our robotics products. This role involves close collaboration with teams across the company to ensure our data systems are robust and effective. Key responsibilities Design and build data systems that power robotic technologies Collaborate with engineers and other groups to align data architecture with project requirements Improve data pipelines and infrastructure for better performance and reliability Work location This position is based in Austin, TX.
Join our dynamic team at PIMCO, a premier global asset management firm with a commitment to helping millions of investors achieve their financial aspirations. With over 3,000 employees across 20 offices in 15 countries, we seek innovative thinkers who thrive in a collaborative environment. At PIMCO, we value diversity, hard work, and a continuous learning ethos.As a Java Site Reliability Engineer (SRE) specializing in Messaging Platforms, you will play a critical role in shaping our technology strategies to enhance operational efficiency. Your responsibilities will include supporting various messaging platforms such as MQ, AMPS, and Kafka, ensuring optimal tool selection and sustainable messaging strategies. You will also focus on improving operational efficiency through advanced tools and monitoring systems.This position requires a passion for messaging systems, collaborative problem-solving skills, and a strong foundation in software development. You will have the opportunity to contribute to critical business solutions that align with our strategic vision for trading applications.
Full-time|On-site|Austin, TX/Akron, Ohio/Irvine, CA
Join Restaurant365 as a Site Reliability Engineer II, where you'll play a vital role in ensuring the availability, performance, and reliability of our systems. You will collaborate with cross-functional teams to design, implement, and maintain robust infrastructure solutions that enhance our operational efficiency.
For over 25 years, Realtor.com® has stood as the premier online platform trusted by real estate professionals, seamlessly connecting buyers, sellers, and renters with invaluable insights and expert advice to discover their ideal home. Our comprehensive suite of tools not only transforms the real estate landscape, but also aids consumers in navigating one of life's most significant decisions—making it simple, intuitive, and empowering.Join us in our mission to enable more individuals to find their way home by dismantling barriers, fostering meaningful connections, and instilling confidence with expert guidance.About the RoleWe are on the lookout for a Staff Site Reliability Engineer to become a vital member of our newly established Operations Excellence organization, reporting directly to the Director of Operations Excellence. This pivotal position will define the reliability, observability, and operational excellence of our platform infrastructure that serves millions of users. As a Staff SRE, you will take on a technical leadership role, mentoring others and establishing best practices, while influencing architectural decisions to empower our team of 600+ engineers in delivering outstanding customer experiences.You will engage with crucial platform systems, including EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack, all while implementing chaos engineering practices and spearheading cost optimization initiatives that yield measurable ROI.We are committed to employing the best tools to expedite problem-solving. You will be expected to adeptly utilize AI coding assistants and LLMs to enhance development speed, generate boilerplate code, and troubleshoot intricate debugging scenarios. In addition to basic usage, this role demands the critical judgment to assess AI-generated outputs for security, performance, and accuracy. You should be comfortable incorporating AI tools into your daily tasks to minimize repetitive work, allowing you to concentrate on high-impact architectural and strategic engineering challenges.What You'll DoPlatform Reliability & InfrastructureDesign and maintain highly available AWS infrastructure, including EKS clusters, Fargate (ECS), and multi-region architectures.Take ownership of the reliability of essential services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and associated infrastructure.Establish SLIs, SLOs, and error budgets for Tier 1/2/3 systems; lead architectural reviews focused on reliability and cost-efficiency.Drive...
In 2024, cybercrime rates are anticipated to escalate, as evidenced by the FBI's IC3 report, which highlighted a staggering loss of over $16 billion. The real estate sector, unfortunately, remains a prime target for cybercriminals, particularly through investment fraud and BEC scams. At CertifID, we are committed to combating this threat by offering a secure platform that authentically verifies the identities of transaction participants, validates wire transfer instructions, and identifies potential fraud attempts. Our innovative technology is engineered to reduce risks, ensuring that every transaction is executed with utmost confidence and security.Our success hinges on our exceptional team. Recognized as one of the Best Startups to Work in Austin, we proudly made the Inc. 5000 list and received the award for Best Culture by Purpose Jobs for three consecutive years. Our core values and vision for a world without wire fraud guide us as we strive to create a dynamic work environment where every team member can make a significant impact in enhancing security and combating fraud.Position Overview:We are on the lookout for a Senior Site Reliability Engineer (Senior SRE) to spearhead reliability enhancements within our production SaaS environment. You will play an essential role in developing scalable infrastructure models, advancing our observability efforts, optimizing incident response, and collaborating with engineering teams to integrate reliability into system design and deployment.This position is tailor-made for a seasoned Senior SRE who thrives on tackling intricate operational challenges, building automation solutions, and mentoring fellow engineers.
Full-time|$110K/yr - $128K/yr|Hybrid|Austin, Texas, United States
Join Striveworks as a Site Reliability EngineerAt Striveworks, we empower organizations to leverage artificial intelligence to tackle real-world challenges in national security and business. Our mission is to serve as the command center where data, models, and business outcomes converge.Founded by a team of passionate data scientists and engineers, Striveworks simplifies the journey from deployment to ongoing optimization. We ensure that our clients aren’t just deploying AI; they’re establishing robust systems that are reliable, adaptable, and poised to scale in an ever-changing landscape.As a Site Reliability Engineer, you will play a pivotal role in implementing and managing corporate systems from day one. You will work with an array of systems and infrastructure automation tools while having the opportunity to innovate and enhance our toolset. Your focus will be on developing sustainable solutions that prevent future issues, thereby minimizing operational toil.Your daily responsibilities will include:Developing and maintaining infrastructure as code across private (OpenStack) and commercial (AWS, Azure, GCP) cloud environments.Creating configuration management automation for Windows laptops and Linux servers.Providing comprehensive user support for all corporate systems.This role is based in a hybrid/on-site setting at our northwest Austin office.
Site Reliability Engineer Overview: Join Weedmaps as a Site Reliability Engineer and collaborate with diverse teams across application development, infrastructure, and quality assurance to elevate the performance, reliability, and scalability of our web services at Weedmaps.com. As a fully cloud-native organization, we operate all our services within Docker containers on Kubernetes, hosted on AWS. Our culture promotes observability, proactive monitoring, and CI/CD automation, enabling us to release multiple production updates daily. In this role, you will utilize your engineering expertise to improve system monitoring, streamline CI workflows, and refine our deployment pipelines. You will serve as a knowledge resource for development teams, guiding them in utilizing standardized tools for metrics, logging, and deployment processes. Collaborate closely with both development and infrastructure teams to identify key service metrics that go beyond the basics, working with application teams to develop libraries that facilitate easy instrumentation of their services. Your Impact: Collaborate with stakeholders to establish best practices in monitoring and CI/CD pipelines. Troubleshoot issues within our deployment CI pipeline. Promote and support a strong DevOps culture within Weedmaps. Identify automation opportunities and advocate for codification across all processes. Share best practices regarding collaboration, reliability, security, and performance with all partner teams. Take responsibility for the configuration and scaling of applications, ensuring adherence to organizational practices. Develop and enhance synthetic monitoring workflows.
As the leading online platform for real estate professionals for over 25 years, Realtor.com® connects buyers, sellers, and renters with trusted insights and expert guidance to find their ideal home. Our comprehensive suite of tools significantly impacts the real estate industry and enhances the consumer experience, making it simple, understandable, and empowering for individuals navigating one of life's biggest purchases.Join us in our mission to help people find their way home by dismantling barriers to entry, establishing the right connections, and fostering confidence through expert guidance.About the RoleWe are looking for a Senior Site Reliability Engineer to become a crucial member of our newly established Operations Excellence organization, reporting directly to the Director of Operations Excellence. In this pivotal role, you will enhance the reliability, observability, and operational excellence of our platform infrastructure that serves millions of users. As a Senior SRE, you will be a key technical contributor, implementing best practices, addressing complex challenges, and empowering our team of over 600 engineers to deliver outstanding customer experiences.Your responsibilities will include working on critical platform systems such as EKS infrastructure, Skyway (CI/CD), Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability stack. You will also play a part in chaos engineering practices and cost optimization initiatives, ensuring measurable ROI.We believe in employing the best tools to solve problems efficiently. You will be expected to adeptly use AI coding assistants and LLMs to accelerate development speed, generate boilerplate code, and resolve complex debugging issues. Beyond mere usage, this role demands the critical judgment to evaluate AI-generated outputs for security, performance, and accuracy. You should be comfortable incorporating AI tools into your daily routines to reduce repetitive tasks, allowing you to concentrate on high-impact architectural and strategic engineering challenges.What You'll DoPlatform Reliability & InfrastructureDesign, implement, and maintain highly available AWS infrastructure, including EKS clusters, Fargate (ECS), and multi-region architectures.Ensure the reliability of essential services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo GraphQL), and their supporting infrastructure.Monitor SLIs, SLOs, and error budgets for Tier 1/2/3 systems; participate in architectural reviews focused on reliability and cost-efficiency.Implement reliability patterns such as circuit breakers, graceful degradation, and automatic failover strategies.
At BetterUp, we believe in the power of human transformation, and our approach to the employer-employee relationship reflects that belief.From the moment you engage with us, you will notice a distinct experience. It's not just about filling a position; it's about joining a mission-driven team.Upon accepting an offer, you gain more than just a paycheck—you will receive a dedicated BetterUp Coach, a personalized development plan, and a supportive manager. You'll also be part of an extraordinary team, each member accompanied by their own BetterUp Coach, working on projects that make a real impact.This unique environment fosters a focused and fulfilling work experience. While it may not be for everyone, for those who are passionate and driven, this role represents a transformative career opportunity.Join us for an intense and rewarding journey, where you'll engage in meaningful work within a vibrant and creative culture.If this resonates with you and the job description aligns with your skills, let’s start a conversation.As a hybrid company, we emphasize in-person collaboration when necessary. Employees must be available to work from one of our office hubs a minimum of two days per week, totaling eight days per month. Our US hubs include: Austin, TX; Chicago, IL; New York City, NY; San Francisco, CA; and the Washington, DC metro area. For roles based in Europe, our hubs are located in London, UK, and Amsterdam, NL. Please ensure you can commit to this structure before applying.Key Responsibilities:Utilize AI-driven tools and automation to enhance monitoring, troubleshooting, and maintenance of production systems.Develop and manage cloud infrastructure on AWS, employing Terraform for codifying and version-controlling our environments.Oversee and scale Kubernetes clusters that support BetterUp's platform, ensuring optimal availability and performance.Create intelligent alerting and observability frameworks.Collaborate with engineering teams to integrate reliability into the development lifecycle, proactively addressing operational concerns.Automate incident response processes and establish self-healing infrastructure.Explore and implement cutting-edge AI tools for log analysis, anomaly detection, and predictive maintenance.
qodeworld is seeking a Senior Site Reliability Architect to join the team in Austin, Texas. This position focuses on unified observability, proactive detection, AIOps, and GenAI-driven operations for distributed financial services platforms. The role requires deep technical expertise in designing and maintaining reliable, high-performance systems across complex architectures. Role overview The Senior Site Reliability Architect will drive enhancements in platform reliability and performance. This includes building SLI/SLO-driven monitoring, implementing dynamic thresholds, and developing intelligent alerting and AI/ML-based anomaly detection. The position is central to evolving operational practices from reactive alerting to proactive, insight-driven approaches. Key responsibilities Design and deploy unified observability dashboards that integrate metrics, logs, traces, events, and system topology. Establish and manage SLIs, SLOs, and error budgets aligned with business goals. Create actionable dashboards for operational, engineering, and leadership teams. Implement advanced alerting strategies using both static and dynamic thresholds. Apply AI/ML/AIOps technologies to detect anomalies, forecast incidents, and reduce MTTR. Shift monitoring practices from reactive alerting to proactive insights. Incorporate noise reduction, alert correlation, and root cause analysis. Use baseline modeling, seasonality detection, and anomaly scoring. Oversee and resolve issues in multi-service architectures, including microservices, APIs, Kafka/streaming platforms, and cloud infrastructure (Terraform, Infrastructure as Code). Analyze and trace issues across upstream/downstream dependencies, streaming platforms, infrastructure, and application code. Work extensively with Dynatrace (mandatory requirement). Utilize tools such as OpenTelemetry, Prometheus/Grafana, ELK/EFK, and cloud-native monitoring solutions (AWS, Azure, GCP). Manipulate and enrich telemetry using JSON. Apply GenAI/LLMs for incident summarization, root cause explanations, runbook recommendations, and auto-remediation suggestions. Collaborate with platform teams to operationalize GenAI technologies safely. Requirements 15+ years of experience in Site Reliability Engineering or Production Engineering. Strong background in unified observability, AIOps, and related fields. Proven experience with AI/ML technologies and cloud-native environments.
About Future Secure AI Future Secure AI develops solutions in artificial intelligence for real-world business challenges. The company values courage, precision, and curiosity, and supports an entrepreneurial culture where every team member is recognized. Leadership is experienced and approachable, with a focus on supporting individual growth. Team members work alongside colleagues from diverse backgrounds and contribute to projects that have impact across industries. Role Overview: Site Reliability Engineer The Site Reliability Engineer will design, build, and maintain the platforms that power Future Secure AI's AI Co-Workers. This is a hands-on position with responsibility for reliability throughout the product lifecycle. The role involves close collaboration with product, AI, and engineering teams to ensure platform stability and performance.
Full-time|On-site|Austin, TX, Reston, VA, Boston, MA
Join our dynamic App Platform team as a Senior Platform Engineer, where you'll wear multiple hats including Architect, Developer, Consultant, and Leader. You'll design robust systems, write code for scalable applications, and collaborate with server teams to deliver high-quality infrastructure solutions. Your ability to articulate your experiences through blog posts and presentations will contribute to our open-source initiatives, enhancing our presence in the tech community.
About TelnyxTelnyx is a trailblazer in the realm of global connectivity, actively constructing the future rather than just envisioning it. Our innovative solutions, from designing a private, global, multi-cloud IP network to delivering hyperlocal edge technology via user-friendly APIs, are revolutionizing seamless interconnections among people, devices, and applications.We are motivated by a commitment to revamping outdated processes, automating manual tasks, and addressing genuine challenges through advanced connectivity solutions. Our financial stability and profitability empower us to invest in cutting-edge technologies and cultivate a culture of continuous learning and advancement for our team.Our vision is a world where unrestricted connectivity drives boundless innovation. By joining us, you will play a pivotal role in laying the groundwork for this interconnected future. We are currently on the lookout for enthusiastic individuals eager to contribute to an industry-defining company while enhancing their own skills and career trajectories.The RoleAs a Messaging Compliance Specialist, you will be our key expert on Application-to-Person (A2P) messaging standards. This role involves bridging the gap between regulatory requirements and technical implementation while ensuring our platform and clients stay compliant with the rapidly changing messaging landscape. You'll also assist in developing tools that streamline the compliance process.
Join ICON as a Reliability Engineer II on the innovative Titan Team, where we create cutting-edge print systems. Your expertise will be crucial in guiding the Titan machine into Serial Production. In this role, you will evaluate system performance, pinpoint vulnerabilities, and develop strategies to enhance the overall reliability and consistency of our products. This position is based at our Austin, TX office.
Join Saronic as a Civil/Site Engineer specializing in Infrastructure, where you will play a pivotal role in designing and implementing innovative engineering solutions. You will collaborate with a diverse team of professionals to ensure the successful execution of infrastructure projects, enhancing the quality and sustainability of civil engineering.
Full-time|Remote|Remote (Atlanta, Austin, San Francisco, Seattle)
Role overview ditto is seeking a Senior Platform Engineer, Operator for a fully remote role. This position is open to candidates located in Atlanta, Austin, San Francisco, or Seattle. The focus is on designing, building, and maintaining systems that keep company operations running smoothly and efficiently at scale. What you will do Design and implement systems that improve platform scalability, performance, and reliability. Maintain and enhance the existing infrastructure to support ongoing business operations. Work closely with cross-functional teams to address technical challenges and streamline processes. Use technical expertise and leadership to drive key platform initiatives. Requirements Extensive experience in platform engineering. Strong problem-solving abilities and a collaborative mindset. Proven ability to contribute technical insights and lead engineering projects. Location This is a remote position. Candidates must be based in Atlanta, Austin, San Francisco, or Seattle.
We are seeking a talented and motivated Platform Engineer to join our dynamic team at Allen Control Systems. In this role, you will be responsible for designing, implementing, and maintaining scalable and robust platform solutions that meet our company's needs. Your expertise will contribute to our innovative projects and help shape the future of our technology stack.
Full-time|Remote|Remote (Atlanta, Austin, San Francisco, Seattle)
ditto is hiring an Engineering Manager to lead the Platform team. This remote role is available to candidates based in Atlanta, Austin, San Francisco, or Seattle. The Platform team focuses on building and integrating solutions that support both user experience and the company’s operational needs. Role overview The Engineering Manager will oversee a group of engineers dedicated to platform systems. This position involves guiding the team’s technical direction and ensuring that platform solutions are both scalable and reliable. What you will do Lead and mentor engineers working on platform systems Guide the development of scalable, reliable solutions Work closely with cross-functional teams to align on project goals and deliverables Encourage continuous improvement and technical excellence within the team Ensure platform integrations operate smoothly and support business objectives Requirements Experience managing engineering teams Strong background in building scalable systems Skilled in project management and working across teams Comfortable working remotely and leading distributed teams Proven commitment to team growth and maintaining high standards
About Base Power Base Power is a US-based power company focused on transforming the energy grid. The team works to build a decentralized power system by deploying distributed batteries across the country. Engineers, operators, and problem-solvers at Base Power address major challenges in the energy sector together. Role Overview: Deployment Engineer – Site Survey This Deployment Engineer position connects field operations with systems engineering. The role centers on improving how Base Power evaluates, approves, and executes hardware deployments at multiple locations. The engineer will refine site survey processes and set configuration standards to keep deployments consistent, secure, and reliable. Key Responsibilities Design and maintain internal tools and automated workflows to scale site survey reviews and make data ingestion across systems more efficient. Act as the technical authority for hardware configurations, setting and enforcing criteria for deployment approvals. Define, document, and uphold high standards for site survey reviews, supporting safety, consistency, and operational efficiency as deployment volume grows. Use SQL and analytics tools to examine field data and installation results, spot process bottlenecks, and drive improvements in deployment operations. Build internal dashboards with tools such as Python, JavaScript, or Retool to provide real-time insights into the site survey pipeline and key metrics. Work closely with Field Operations, Hardware Engineering, and Software teams to turn deployment challenges into engineering solutions and technical requirements. Develop and maintain detailed documentation for review criteria, internal tools, configuration standards, and operational processes. Location: Austin, TX
mks2technologies seeks an On-site IT Customer Service Engineer to join the team in Austin, TX. This position acts as the primary contact for IT support, assisting clients with technical issues to help keep their daily operations running smoothly. Key responsibilities Diagnose and troubleshoot technical problems directly at client sites Offer clear, practical solutions and support Maintain attentive and timely customer service with every client interaction Work location This role is fully on-site in Austin, TX. Regular presence at client locations is required.
Apptronik seeks a Data Platform Engineer to shape and enhance the data architecture supporting our robotics products. This role involves close collaboration with teams across the company to ensure our data systems are robust and effective. Key responsibilities Design and build data systems that power robotic technologies Collaborate with engineers and other groups to align data architecture with project requirements Improve data pipelines and infrastructure for better performance and reliability Work location This position is based in Austin, TX.
Apr 23, 2026
Sign in to browse more jobs
Create account — see all 995 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.