Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
We are seeking a talented Senior Cloud Infrastructure Engineer responsible for architecting and scaling our deployment infrastructure that supports agent behavior monitoring at an enterprise level. This pivotal role will enable our enterprise customers to deploy Judgment Labs solutions in diverse environments, including multi-region cloud setups, self-hosted solutions, or Bring Your Own Cloud (BYOC) deployments, all while ensuring adherence to security, compliance, and reliability standards. The ideal candidate will have substantial experience in building distributed systems capable of managing real production traffic and will take ownership of the infrastructure from architecture to operations.
About the job
At Judgment Labs, we are pioneering the way that Agent Behavior Monitoring (ABM) is approached. Unlike conventional observability methods that primarily focus on logging exceptions and latency, our innovative ABM technology identifies behavioral anomalies such as instruction drifts and context retrieval losses in large-scale production environments.
Our platform is trusted by numerous teams developing autonomous agents, enabling them to gain insights into system behavior post-deployment. By moving beyond reactive incident management, our users can analyze patterns across conversations and workflows, correlate regressions to specific interaction types, and accurately identify where reliability issues arise within their operational context.
Recent funding success: We have successfully raised over $30M across two funding rounds within the last five months, attracting notable investors such as Lightspeed, SV Angel, Valor Equity Partners, and more.
About Judgment Labs
Judgment Labs is at the forefront of developing innovative infrastructure solutions for Agent Behavior Monitoring. Our unique approach empowers teams to proactively understand and enhance the performance of autonomous agents, driving reliability and efficiency in production environments.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, aiming to propel AI advancements by making inference processes more efficient and cost-effective. Our company is founded by the original creators and core maintainers of vLLM, placing us at a unique intersection of models and hardware, a position we have cultivated over ma…
Join Handshake as a Senior Cloud Engineer, where you'll play a pivotal role in designing and implementing scalable cloud solutions. Your expertise will help us enhance our cloud infrastructure while ensuring high availability and security.We are looking for a passionate and driven individual who thrives in a fast-paced environment, is eager to tackle complex challenges, and enjoys collaborating with cross-functional teams. If you have a deep understanding of cloud architectures and services, we want to hear from you!
Role Overview Crusoe Technologies is seeking a Senior Staff Software Engineer focused on Managed Orchestration to help shape the direction of our cloud infrastructure. This position is based in San Francisco, CA. What You Will Do Design and implement scalable orchestration solutions for cloud infrastructure. Lead a team of engineers, providing technical guidance and mentorship. Work closely with cross-functional groups to integrate services and products smoothly. Apply deep technical expertise to drive the development of new technologies that improve operational efficiency and customer experience. About Crusoe Technologies Crusoe Technologies builds innovative solutions for cloud computing with a focus on efficiency and sustainability.
Full-time|$180K/yr - $210K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to accelerate the fusion of energy and intelligence. We are building the infrastructure that empowers individuals to innovate boldly with AI, ensuring that our advancements come without compromises in scale, speed, or sustainability.Join us in revolutionizing AI with sustainable technology at Crusoe, where you will spearhead impactful innovations, contribute to meaningful projects, and collaborate with a team that is reshaping the future of responsible cloud infrastructure.About the Role:We are looking for a talented Senior Software Engineer to join our cloud software team. Your role will be pivotal in enhancing our state-of-the-art infrastructure. You will leverage your expertise to design and scale our carbon-reducing operating model while managing essential hardware, software, and networking components.In this position, you will write and review code, develop proposals, and contribute to architectural documents. You will assess tools and frameworks, weighing their implications on reliability, scalability, operational costs, and ease of implementation. Your knowledge of orchestration and optimization will be crucial in advancing our managed Kubernetes and AI training clusters, ensuring they maintain a competitive edge in reliability and performance.What You'll Be Working On:Develop scalable and resilient software solutions that align with the strategic goals outlined in the Crusoe Cloud roadmap.Collaborate with tech leads and engineers to foster an environment of creativity and technical excellence, driving the development of innovative cloud solutions.Stay updated on the latest cloud software trends and techniques, integrating these insights to keep Crusoe’s innovations at the forefront of the industry.Although you won’t have formal management responsibilities, you will mentor your colleagues by sharing knowledge and guiding technical discussions.What You'll Bring to the Team:5-7 years of experience in software engineering, with strong expertise in Systems Engineering.2+ years of programming experience in GoLang.Experience with Kubernetes and Linux engineering, including debugging capabilities.A proactive attitude towards problem-solving and continuous learning.
About UsBraintrust is at the forefront of AI observability, seamlessly integrating evaluations and observability into a single workflow. Our platform empowers innovators by providing them with the critical insights needed to understand AI performance in production environments and the tools required to enhance it.Recognized by leading companies such as Notion, Stripe, Zapier, Vercel, and Ramp, Braintrust enables teams to compare AI models, test prompts, and detect regressions, transforming production data into superior AI with each iteration.Role OverviewWe are seeking a talented Cloud Infrastructure Engineer to join our team and contribute to the development of a robust and scalable infrastructure. You will provide developers with a premium platform to deploy code efficiently and confidently. Your role will involve leading initiatives across Terraform, Kubernetes, CI/CD, observability, and support, significantly impacting Braintrust's internal operations and the self-hosted experiences of our customers.This position is pivotal as you will manage our AWS environment while assisting customers in deploying our infrastructure on AWS, Azure, and GCP.Your ResponsibilitiesDevelop and maintain Terraform modules for both internal infrastructure and customer deployments.Engage directly with customers via Slack to assist with self-hosting and troubleshoot infrastructure challenges, creating tools to simplify their support process.Take ownership of our CI/CD pipeline, aiming to reduce build times, enhance failure visibility, and facilitate safer, quicker releases.Centralize and scale observability through logs, metrics, dashboards, and alerts.Collaborate with engineering teams to create and enhance a secure, developer-friendly infrastructure platform.Support multi-cloud deployment strategies, primarily in AWS, while also extending support for Azure and GCP for our enterprise clientele.Implement tools and automation to bolster deployment, rollback, and infrastructure reliability.Ideal Candidate ProfileA minimum of 5 years of experience in DevOps, SRE, or Infrastructure Engineering roles.In-depth knowledge of Terraform and experience with at least one major cloud provider, preferably AWS.Proficient in Kubernetes, with capabilities in deploying, debugging, and scaling real workloads.Strong programming skills in scripting languages like Python, Typescript, or Go.Experience in supporting production systems and managing incidents effectively.Comfortable working closely with customers in a support or deployment capacity.Bonus: Familiarity with monitoring and logging tools, as well as knowledge of security best practices.
Full-time|$144K/yr - $258K/yr|On-site|San Francisco
At Braze, we take immense pride in our people. Our team is approachable, exceptionally kind, and driven by passion.We are committed to fostering this passion by establishing high standards, promoting teamwork, and cultivating a harmonious work-life balance as we collectively navigate rapid global growth while advocating for equity and opportunity both within and outside our organization.To thrive in this environment, you must be ready to set ambitious goals for yourself and those around you. There is always an opportunity to contribute: exercising autonomy, embracing accountability, and welcoming diverse perspectives are vital to our ongoing success.Our deep curiosity and eagerness to share our varied interests with one another enrich our culture, creating a unique vibrancy.If you are motivated to tackle exciting challenges and are proactive in adapting to change, you will have the chance to make a significant impact here, supported by a talented and enthusiastic team. If Braze resonates with you, we eagerly await the opportunity to meet you.WHAT YOU'LL DOWe are looking for a Senior Cloud Security Engineer to join our established Security Engineering team. Braze is a modern, cloud-first SaaS organization that operates entirely on cloud-native infrastructure, utilizing large-scale, distributed systems across AWS, GCP, and self-managed Kubernetes environments. We seek an engineer with profound cloud security expertise who can collaborate with DevOps, Infrastructure, and Product Engineering teams to enhance our cloud security posture, secure our platforms, and contribute to the future of Cloud Security at Braze.As a Senior Cloud Security Engineer at Braze, you will engage in a variety of initiatives, including:Collaborating closely with Infrastructure, SRE, and Product Engineering to design secure cloud architectures and develop scalable security controls for both new and existing services.Implementing and enhancing end-to-end cloud security controls across AWS, GCP, Kubernetes, CI/CD pipelines, and self-managed systems.Leading and refining our existing vulnerability management workflow for cloud assets, including scanning, triage, prioritization, and remediation using tools like Tenable and native CSP capabilities.Managing and optimizing security tooling such as CrowdStrike (EDR/CSPM/IR), cloud-native security services, and SIEM detection rules with the assistance of our established SIEM Management function.Conducting threat modeling for new cloud technologies and patterns adopted throughout engineering.Playing an active role in incident response, cloud forensics, and runtime security investigations.
At Judgment Labs, we are pioneering the way that Agent Behavior Monitoring (ABM) is approached. Unlike conventional observability methods that primarily focus on logging exceptions and latency, our innovative ABM technology identifies behavioral anomalies such as instruction drifts and context retrieval losses in large-scale production environments.Our platform is trusted by numerous teams developing autonomous agents, enabling them to gain insights into system behavior post-deployment. By moving beyond reactive incident management, our users can analyze patterns across conversations and workflows, correlate regressions to specific interaction types, and accurately identify where reliability issues arise within their operational context.Recent funding success: We have successfully raised over $30M across two funding rounds within the last five months, attracting notable investors such as Lightspeed, SV Angel, Valor Equity Partners, and more.
Join Perplexity as a Senior Cloud Security Engineer and play a pivotal role in transforming how users search and interact with the internet. As a key member of our innovative security team, you will spearhead initiatives to construct and sustain secure and scalable cloud infrastructure, enabling our engineers to innovate swiftly and securely.Core ResponsibilitiesCollaborate with infrastructure and engineering teams to embed security measures into development processes and advocate for secure-by-default practices.Develop Terraform modules that incorporate essential security features, including logging, encryption, and automated threat detection.Implement cloud-native detection capabilities utilizing AWS GuardDuty, Security Hub, and tailor-made detection rules to uncover credential breaches, crypto-mining, and lateral movements.Ensure compliance with SOC 2 Type II and ISO 27001 by automating the collection of cloud control evidence.Conduct security assessments of cloud resource configurations using tools like AWS Config and Open Policy Agent, addressing discrepancies in line with CIS Benchmarks and internal security policies.Fortify CI/CD and supply chain pipelines through controls such as artifact signing, secret scanning, and dependency monitoring.Implement zero trust principles via stringent network segmentation, authentication, and authorization across cloud environments.Engage in security on-call rotation, responding to security alerts and incidents for prompt resolution and root cause analysis.
Full-time|$115K/yr - $135K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to foster a future abundant with energy and intelligence. We are building the infrastructure that enables ambitious AI creations without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology at Crusoe. In this role, you will drive significant innovations, contribute to impactful solutions, and collaborate with a team that is at the forefront of responsible, transformative cloud infrastructure.About This RoleWe are looking for an Atlassian Cloud Engineer who will be the primary architect and strategic leader of our Atlassian ecosystem. This role combines in-depth technical knowledge with strong project management skills to ensure that Jira, Confluence, and related tools are optimized for collaboration, delivery, and service management. As the platform owner and change agent, you will drive innovation, implement best practices, and facilitate adoption as our business scales.What You’ll Be Working OnManaging the daily administration and long-term strategy for the Atlassian suite, including Jira Software, Jira Service Management, Confluence, Opsgenie, Product Discovery, and Statuspage.Customizing workflows, permissions, dashboards, and automation to enhance project execution, visibility, and inter-team collaboration.Collaborating with project managers, business leaders, and IT teams to integrate effective project management and IT service management practices within the Atlassian tools.Designing sophisticated reporting and analytics using eazyBI, JQL, filters, CQL, and dashboards to enable real-time decision-making.Serving as the escalation point for complex platform issues, working closely with Atlassian Support and our internal Service Desk teams.Leading enhancements of the platform by staying updated with Atlassian’s roadmap and promoting the adoption of new features.Advancing AI initiatives within the Atlassian ecosystem, including the development of custom agents using Atlassian Rovo and the implementation of the Rovo MCP server.Collaborating across IT to ensure system reliability, security, and alignment with broader business objectives.What You’ll Bring to the TeamA minimum of 3 years of experience in administering Atlassian Cloud environments in enterprise settings.Demonstrated ownership of Jira and Confluence customization, site administration, and daily operations.Proficient in advanced reporting and analytics tools.Strong project management skills and the ability to collaborate effectively across teams.
Full-time|On-site|San Francisco, California, United States
Join code-metal as a Senior Platform DevOps Engineer, where you will play a pivotal role in enhancing our cloud and on-premises infrastructure. You will be responsible for deploying, managing, and optimizing systems to ensure high availability and performance. This position offers an exciting opportunity to work with cutting-edge technologies and collaborate within a dynamic team.
Join our dynamic team at leverdemo-8 as a Software Engineer specializing in Cloud Infrastructure. We are passionate about reimagining the hiring landscape and are looking for talented engineers to enhance our YugaByte DB for enterprise applications. Your expertise will contribute to optimizing orchestration support across major public clouds including AWS, Google Cloud, and Azure, as well as Kubernetes services and private data centers. You'll play a crucial role in the control and manageability plane of YugaByte and collaborate with tools such as Prometheus and Alert Manager to ensure seamless infrastructure management.Please note that this position is part of Lever's testing environment; we kindly ask you not to apply for this role.
Join our innovative team at litellm as a DevOps Engineer. In this role, you will be instrumental in enhancing our development and operations processes, ensuring seamless integration and delivery of our services. Collaborate with cross-functional teams to design, implement, and manage scalable infrastructure solutions.We are looking for a passionate individual with a strong foundation in cloud technologies, automation, and continuous integration/continuous deployment (CI/CD) practices. Your expertise will help us drive efficiency and reliability in our software delivery lifecycle.
Join our innovative team at Crusoe as a Staff Product Manager for Orchestration. In this pivotal role, you will lead our efforts in enhancing product orchestration strategies, ensuring seamless integration and execution of our technology solutions. Your expertise will guide cross-functional teams, drive product vision, and ultimately contribute to our mission of transforming the technology landscape.
The OpportunityJoin rowspace as an Infrastructure Engineer and play a pivotal role in constructing and safeguarding the core of our cutting-edge AI data platform. In this position, you'll engineer systems capable of managing extensive volumes of sensitive financial information while adhering to rigorous security and compliance standards. Your work will involve real-time integration of public data with private, tenant-isolated customer data at scale.Key ResponsibilitiesDesign and implement scalable infrastructure to support our AI-driven knowledge engine that processes both structured and unstructured financial data.Establish a security-first architecture for private cloud environments, ensuring data governance aligns with financial services regulations.Create resilient data ingestion pipelines that accommodate a variety of data sources, from CapIQ feeds (structured data) to internal SharePoint documents (unstructured data).Develop comprehensive monitoring and alerting systems for our BYOC platform.Enforce access controls and maintain audit trails to ensure that AI interactions can be traced back to primary sources.Collaborate with our AI Research and Product teams to enhance infrastructure for LLM inference and training workloads, as well as agent infrastructure development.Establish CI/CD practices and infrastructure-as-code for swift, reliable deployments across multiple cloud providers.
Full-time|$133.2K/yr - $159.8K/yr|On-site|San Francisco, CA
At Fastly, we empower individuals to forge deeper connections with the things they cherish. Our cutting-edge edge cloud platform enables clients to swiftly, securely, and reliably craft exceptional digital experiences by processing, serving, and safeguarding their applications as close to their end-users as possible — right at the edge of the Internet. This platform is tailored to leverage the modern internet, is highly programmable, and supports agile software development methodologies. Our clientele includes renowned global brands, such as GitHub, Yelp, Paramount, and JetBlue.Join us in our mission to create a more trustworthy Internet.Posting Open Date: March 13th, 2026Anticipated Posting Close Date*: May 30th, 2026*Job posting may close early due to the volume of applicants.Data EngineerAs part of Fastly's Analytics team, you will empower leaders across the organization with actionable data that drives essential business decisions. We are focused on expanding and enhancing our premier internal data platform. In your role as a Data Engineer, you will play a pivotal part in transforming our data infrastructure, optimizing data pipelines on Google Cloud Platform (GCP), scaling the ingestion of complex data sources, and adhering to best practices for performance and reliability. This is your chance to contribute to significant projects within a dynamic, collaborative, and innovative workspace, supporting data scientists, analysts, and business analysts across our organization.
Full-time|$180K/yr - $220K/yr|On-site|San Francisco, CA - US
At Crusoe, we are on a mission to revolutionize the future by accelerating the abundance of energy and intelligence. We are building the foundational engine that empowers individuals to create bold innovations with AI while ensuring sustainability, speed, and scalability.Join us in the forefront of the AI revolution with cutting-edge sustainable technology. You will play a pivotal role in driving meaningful innovation, making a significant impact, and collaborating with a team that is leading the way in responsible, transformative cloud infrastructure.About the RoleAs a Senior Staff Cloud Support Engineer, you will serve as a technical expert within Crusoe Cloud and significantly enhance the efforts of our Customer Experience, SRE, Networking, Fleet, and Product teams. Your role transcends basic ticket resolution; you will design reliability frameworks, influence architectural decisions, mentor senior engineers, and safeguard revenue by averting large-scale incidents. With profound expertise in Linux systems, Kubernetes, networking, and AI/ML infrastructure, you will apply your knowledge with a strong focus on customer satisfaction. You will be comfortable navigating uncertainty, leading incident responses, and shaping the global scaling of high-performance AI infrastructure.Key ResponsibilitiesAct as the top escalation point for complex P1/P0 incidents.Lead cross-functional investigations into root causes involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers.Collaborate with SRE and Software teams (Storage, Networking, Compute, K8) to devise systemic solutions rather than temporary fixes.Reliability ArchitectureDesign and enhance node validation, burn-in processes, performance baselining, and release readiness.Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability.Minimize MTTR and prevent incident recurrence through structural enhancements.AI/ML Infrastructure ExpertiseTroubleshoot NCCL, IB, GPU driver/firmware issues, and distributed training failures.Support complex AI workloads (training + inference) through performance tuning and observability enhancements.Customer-Facing AuthorityAct as a senior technical advisor during high-stakes customer incidents.
At Greptile, we are on a mission to develop intelligent agents that autonomously verify code modifications. Our current focus involves utilizing AI to analyze pull requests on GitHub, effectively identifying bugs and enforcing coding standards. With our technology, we review nearly 1 billion lines of code each month for over 3,000 companies.Challenges We Are Excited To TackleDeveloping agents that can learn coding standards through experience, similar to how new hires adapt.Determining customer-specific preferences for pull request feedback using sample-efficient reinforcement learning to enhance signal-to-noise ratios.Implementing automated deployments of feature branches and leveraging agents to stress-test the application for bug detection.Our Growth TrajectoryServing over 7,000 customers.Successfully raised $30 million from prominent investors including Benchmark, Y Combinator, Paul Graham, and Initialized.Our TeamWe have curated a highly skilled team that has successfully scaled vital functions at leading companies such as Stripe, Google, Figma, and others.Key ResponsibilitiesDesign and implement resilient infrastructure to accommodate Greptile's expanding user base.Collaborate with our largest enterprise clients to facilitate the deployment of Greptile within their environments.Streamline the on-premise deployment process to support smaller clients with minimal hands-on intervention.
About SieveSieve stands as a pioneering AI research lab dedicated solely to video data. Our innovative approach integrates exabyte-scale video infrastructure with state-of-the-art video understanding techniques and a myriad of data sources, creating unparalleled datasets that redefine video modeling. With video accounting for 80% of global internet traffic, it has become the vital digital medium fueling creativity, communication, gaming, AR/VR, and robotics. At Sieve, we aim to eliminate the most significant bottleneck hindering the expansion of these applications: access to high-quality training data.With strategic partnerships with leading AI labs, our team of just 12 has achieved remarkable financial success, generating $XXM last quarter alone. Earlier this year, we secured Series A funding from elite firms including Matrix Partners, Swift Ventures, Y Combinator, and AI Grant.About the RoleAs we process petabytes of video across numerous nodes and cloud environments, ensuring reliability, observability, and security is essential to our growth.We are seeking our inaugural Reliability Engineer, who will focus entirely on fortifying the infrastructure that underpins Sieve. This role demands high ownership and a deep understanding of:System throughput and stabilityMonitoring and incident managementSecurity principles, including least-privilege designMinimizing operational burdens for the entire engineering teamYou will collaborate closely with our CTO and founding engineers to develop the foundational tools that empower our engineering efforts.This position is ideal for an engineer who is passionate about reliability, throughput, observability, and security. You are proactive in anticipating potential failure modes, reducing operational risks, and designing resilient systems.If a system failure occurs, you take it personally, thriving under the weight of responsibility.What You'll Be DoingCollaborate with engineering to design and validate infrastructure supporting PB-scale workloadsDevelop and manage Terraform-based multi-cloud deploymentsEnhance cloud and data security (SSO, IAM, least privilege access, auditability)Lead incident response efforts and strengthen systems against failuresCreate CI/CD systems to minimize user errors and maximize safetyEstablish monitoring and alerting frameworks (Prometheus, OpenTelemetry, VictoriaMetrics)
Join the City and County of San Francisco as a Stationary Engineer at our Sewage Plant. In this role, you will be responsible for operating and maintaining essential machinery and systems that ensure the safe and efficient processing of sewage. This is a critical position that contributes to public health and environmental protection.
Role OverviewAt Variance, we are at the forefront of teaching machines to execute high-stakes judgment calls on a large scale. This involves developing AI agents that navigate the complex domains of risk investigations, fraud detection, and identity verifications.Our San Francisco-based team is small yet exceptionally talented, comprising former founders and specialists from leading AI laboratories. We cater to an impressive clientele, including Fortune 500 companies, global marketplaces, and regulated financial institutions. If you are passionate about taking ownership, working swiftly, and collaborating closely with founders, you will thrive in our environment.We are seeking a Security Engineer to help establish a robust security foundation. You will collaborate across product, infrastructure, and internal systems to ensure that Variance is secure by design, enabling us to meet the rigorous standards needed to deploy AI in critical workflows for the world’s largest corporations.
Mar 30, 2026
Sign in to browse more jobs
Create account — see all 11,600 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.