Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Senior
Qualifications
Proven experience in infrastructure engineering and cloud systems. Strong understanding of network architecture, security protocols, and infrastructure automation. Experience with DevOps practices and tools. Excellent problem-solving skills and the ability to work independently. Strong communication skills, both verbal and written.
About the job
Join Fieldguide as a Senior Infrastructure Engineer and be at the forefront of our innovative infrastructure solutions. In this role, you will lead the design, implementation, and maintenance of our infrastructure systems while ensuring optimal performance, security, and scalability. Your expertise will help shape our technology strategy and drive impactful projects.
About Fieldguide
Fieldguide is a forward-thinking technology company dedicated to transforming the way organizations manage their infrastructure. We pride ourselves on our collaborative culture and commitment to innovation. Join us and be part of a team that values creativity and growth.
Similar jobs
1 - 20 of 6,864 Jobs
Search for Senior Infrastructure And Performance Engineer
Senior Infrastructure & Performance EngineerAs a Senior Infrastructure & Performance Engineer, you will take charge of enhancing the performance, reliability, and scalability of Nash's foundational infrastructure. Collaborating closely with the Engineering Leadership and both platform and product engineering teams, you will design and manage low-latency, mission-critical systems that facilitate real-time logistics for some of the world's largest retailers.This is a key senior role focused on elastic capacity, high availability, cloud-native architectures, Postgres performance, and enterprise-grade CI/CD for multi-region deployments. You will define the technical roadmap, establish best practices, and implement systems that support the essential workflows of major retailers.Key ResponsibilitiesOversee infrastructure performance and reliability for Nash's production environments, ensuring low latency, high throughput, and consistent performance under load.Design, develop, and enhance AWS infrastructure, utilizing managed services with a focus on ECS/Fargate.Lead initiatives in Postgres performance engineering, including query optimization, indexing strategies, connection management, replication, cluster design, and failover.Architect and maintain multi-region, highly available systems with robust resiliency and guaranteed disaster recovery.Design and refine enterprise-grade CI/CD pipelines that enable safe, repeatable, and rapid deployments across environments and regions.Establish observability standards (metrics, logs, tracing, SLOs) to proactively identify and resolve performance bottlenecks.Collaborate with application engineers to inform system design choices that influence scalability, latency, and reliability.Lead incident response efforts and postmortems, emphasizing root cause analysis, systemic improvements, and long-term resilience.Set best practices for infrastructure and performance while mentoring engineers throughout the organization.Qualifications6+ years of experience in building and managing high-scale production infrastructure for mission-critical systems.Proficiency with AWS, particularly with ECS/Fargate, and experience with cloud-native architecture.Strong background in Postgres performance tuning and optimization.Deep understanding of CI/CD practices and experience in multi-region deployments.Exceptional analytical and problem-solving skills, with a proactive approach to performance management.
Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California
P-97 At Databricks, we are dedicated to empowering data teams to tackle some of the most challenging problems in the world. We achieve this by creating and managing a leading data and AI infrastructure platform that enables our clients to leverage deep data insights for business enhancement. Our commitment to pushing the limits of data and AI technology is matched by our focus on resilience, security, and scalability, which are essential for our customers' success on our platform. Databricks operates one of the largest-scale software platforms, comprising millions of virtual machines that generate terabytes of logs and process exabytes of data daily. Given our scale, we frequently encounter cloud hardware, network, and operating system faults, and our software must adeptly protect our customers from these issues. As a Senior Performance Engineer, you will collaborate with various teams throughout the organization to assess product and feature performance, pinpoint performance bottlenecks, and partner with engineers to address performance and scalability challenges. This includes setting performance goals for different software releases, guiding teams in developing performance benchmarks, conducting competitive benchmark analyses for various Databricks products, and performing in-depth analyses to identify and resolve performance issues.
Full-time|$180K/yr - $210K/yr|On-site|San Francisco, CA
About Sigma Computing Sigma Computing builds AI-powered apps and analytics tools that connect directly to cloud data warehouses. Teams use Sigma to create applications, automate workflows, and analyze live data through a spreadsheet interface, SQL and Python editors, visual builders, and integrated AI features. The platform supports everything from interactive analyses to reports and embedded data experiences. Role Overview: Senior Product Manager - Platform Performance & Infrastructure Sigma is growing to serve larger enterprises with demanding, complex workloads. The Senior Product Manager for Platform Performance & Infrastructure will guide the development of core backend systems that keep Sigma responsive and reliable as usage scales. This role focuses on driving improvements in: Workbook performance Query lifecycle management Compute and caching strategies Metadata services Compiler components New warehouse connectors These systems are essential for Sigma’s ability to deliver consistent, high-quality performance to enterprise customers. What You Will Do Define and prioritize product enhancements for backend platform performance and scalability Work closely with platform engineering and cross-functional teams to address technical challenges Translate performance and scalability needs into clear product requirements and measurable objectives Ensure Sigma’s infrastructure can support enterprise clients with reliability and speed Who We’re Looking For Experienced Senior Product Manager with strong technical background Comfortable working hands-on with backend systems and infrastructure Skilled at collaborating with engineering and cross-functional partners Focused on delivering measurable improvements for customers Location & On-Site Requirement This position is based in San Francisco, CA. It requires working on-site at the Sigma office at least four days per week.
Join Crusoe as a Senior Systems Performance Engineer, where you will play a crucial role in optimizing and enhancing our systems for superior performance. You will be responsible for diagnosing performance bottlenecks, implementing solutions, and ensuring that our infrastructure can scale efficiently. Work in a dynamic environment that encourages innovation and professional growth.
Full-time|Remote|San Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States
Join Mercury as a Senior Infrastructure Engineer, where you will be pivotal in shaping the infrastructure that supports our innovative financial solutions. You will work closely with cross-functional teams to design, implement, and maintain scalable and reliable infrastructure systems. This role is ideal for individuals who thrive in a fast-paced environment and are passionate about leveraging technology to drive business success.
About UsAt Lemurian Labs, we are dedicated to democratizing AI technology while prioritizing sustainability. Our mission is to create solutions that minimize environmental impact, ensuring that artificial intelligence serves humanity positively. We are committed to responsible innovation and the sustainable growth of AI.We are in the process of developing a state-of-the-art, portable compiler that empowers developers to 'build once, deploy anywhere.' This technology ensures seamless cross-platform integration, allowing for model training in the cloud and deployment at the edge, all while maximizing resource efficiency and scalability.If you are passionate about scaling AI sustainably and are eager to make AI development more powerful and accessible, we invite you to join our team at Lemurian Labs. Together, we can build a future that is innovative and responsible.The RoleWe are seeking a Senior ML Performance Engineer to take charge of designing and leading our Performance Testing Platform from inception. In this pivotal role, you will be recognized as the technical expert in measuring, validating, and enhancing the performance of large language models (including Llama 3.2 70B, DeepSeek, and others) prior to and following compiler optimization on cutting-edge GPU architectures.This is a critical position that will significantly impact our product quality and customer success. You will work at the intersection of Machine Learning systems, GPU architecture, and performance engineering, constructing the infrastructure that substantiates the value of our compiler.
Join our dynamic team at Bland Inc. as a Senior Infrastructure Engineer, where you will play a critical role in designing and implementing robust infrastructure solutions. You will work alongside a talented group of professionals, using cutting-edge technology to drive innovation and efficiency.
Full-time|Remote|San Francisco, CA or Remote (USA)
Join Fieldguide as a Senior Infrastructure Engineer and be at the forefront of our innovative infrastructure solutions. In this role, you will lead the design, implementation, and maintenance of our infrastructure systems while ensuring optimal performance, security, and scalability. Your expertise will help shape our technology strategy and drive impactful projects.
Be part of our mission to redefine AI by shaping the narrative surrounding document understanding.Role OverviewAt LlamaIndex, our Infrastructure team lays the groundwork for our product and provides essential tools that facilitate the development, deployment, and monitoring of our code. We are tasked with designing, constructing, and scaling the core infrastructure that drives a high-capacity data platform for AI applications. We seek individuals who are passionate about creating supportive systems that enhance our engineering capabilities and contribute to our rapidly expanding product suite.Ideal candidates will have a strong background in cloud infrastructure management, navigating various scalability challenges, and enhancing the productivity of the broader Engineering team. Key traits we value in our culture include a customer-centric mindset, collaboration, diligence, and optimism. We are looking for proactive team players who are eager to help us evolve our culture as we grow.Key ResponsibilitiesCollaborate with engineering teams to develop and maintain foundational systems that empower developers and support our rapid growth.Design and execute scalable infrastructure solutions suitable for various deployment models, including SaaS, single-tenant, and private environments.Oversee and optimize cloud resources and Kubernetes clusters to ensure cost-effectiveness and high performance.Facilitate successful external customer deployments by establishing clear infrastructure guidelines and principles.Enhance the release and deployment processes to improve efficiency and reliability.Ensure compliance with applicable regulations and implement comprehensive security measures across all deployment environments.QualificationsMinimum of 5 years of engineering experience.Experience working on Platform or Infrastructure teams on substantial projects involving infrastructure components like Terraform/CDKTF, Kubernetes, Helm, testing infrastructure, release management, and observability.Proficient in optimizing cloud resource utilization.Skilled in tuning Kubernetes clusters and cloud resources for optimal performance and cost efficiency.Dedicated to cultivating LlamaIndex’s engineering culture as we expand.Ability to balance speed and pragmatism in delivering solutions.
Full-time|$143K/yr - $210K/yr|On-site|Livingston, NJ / New York, NY / Sunnyvale, CA / San Francisco, CA / Bellevue, WA
Join CoreWeave: The Essential Cloud for AI™At CoreWeave, we empower innovators to build and scale AI confidently. Our platform, crafted by pioneers for pioneers, is trusted by top AI labs, startups, and global enterprises. With superior infrastructure performance and deep technical expertise, we accelerate breakthroughs and transform compute into capability. Since our inception in 2017, we've grown to become a publicly traded company (Nasdaq: CRWV) by March 2025. Discover more at www.coreweave.com.Role Overview:As a Senior Marketing Performance Analyst, you will be pivotal in architecting our measurement strategy. Your role transcends mere reporting; you will uncover insights that define the 'why' and 'what's next' for our marketing initiatives. Collaborating with Growth, Events, and Operations teams, you will develop and enhance a cohesive metrics framework tracking the buyer's journey from the initial digital interaction to closed-won deals. Your mission is to ensure that every dollar spent on events and digital tactics is quantifiable and optimized in alignment with our long-term revenue objectives.Key Responsibilities:Metrics Framework & Taxonomy: Collaborate with Marketing Operations, Growth, and Demand Generation teams to establish KPIs, performance frameworks, reporting hierarchies, and custom attribution models to assess marketing impact.Goal Setting & Forecasting: Spearhead the annual and quarterly target-setting process, utilizing historical conversion data to create 'reverse funnel' forecast models that clarify required traffic and lead volumes to achieve sales targets.End-to-End Analysis: Evaluate the performance of integrated campaigns, comparing the high-touch ROI of field events with the efficiency of digital channels (SEM, Paid Social, Content Syndication).Insight Synthesis: Go beyond dashboards to craft executive-level narratives. Present weekly updates and monthly/quarterly business reviews to Marketing and Sales leadership.
Are you a passionate engineer with a knack for building robust infrastructure? Join our dynamic team at fal as a Senior/Staff Infrastructure Engineer. In this pivotal role, you will design and implement innovative solutions that enhance our infrastructure's efficiency and reliability.As a key member of our engineering team, your responsibilities will include:Architecting scalable infrastructure solutions to meet our growing needs.Collaborating with cross-functional teams to identify and resolve infrastructure challenges.Implementing automation tools and frameworks to streamline operations.Monitoring performance and ensuring the security of our systems.Providing mentorship and guidance to junior engineers.We are looking for individuals who thrive in a fast-paced environment and have a deep understanding of infrastructure technologies.
Full-time|$160K/yr - $300K/yr|On-site|San Francisco
About ApiphanyApiphany is a trailblazing AI company focused on revolutionizing physical product development. We empower innovators across automotive, aerospace, medtech, and energy sectors to convert vast unstructured technical data into real-time, actionable insights. Supported by elite investors including Markforged, Databricks, GM, and Character, our mission is to transform engineering decision-making, turning complexity into simplicity for leading manufacturers worldwide.Our advanced models are designed to address the intricacies of engineering and manufacturing, comprehending physics principles, design specifications, and program constraints. Our small, elite team consists of builders hailing from prestigious institutions such as Stanford, Berkeley, MIT, UW, and CMU, along with industry veterans from GM, Ford, and Genesis Therapeutics. We are committed to advancing hard-tech and establishing a market-leading company together.About the RoleIn the role of Senior / Staff Infrastructure Engineer at Apiphany, you will architect, build, and manage the infrastructure that underpins our intelligence platform. Your responsibilities will encompass secure, reliable, and scalable cloud deployments, including the unique challenge of deploying across both internal and customer-managed cloud environments.You will ensure our systems adhere to stringent requirements for latency, availability, and compliance within data-intensive environments. Additionally, you will shape our security strategy, implement infrastructure-as-code practices, and establish a solid foundation enabling engineering teams to deliver with assurance.
At ClickUp, we're not just developing software; we're shaping the future of work! In an era dominated by work sprawl, we identified a more efficient way. This led us to create the first truly integrated AI workspace, consolidating tasks, documents, chat, calendar, and enterprise search, all enhanced by context-driven AI. Our mission is to empower millions of teams to escape silos, reclaim their time, and reach unprecedented levels of productivity. At ClickUp, you'll have the chance to learn, innovate, and leverage AI in transformative ways that will not only influence our product but also the broader landscape of work itself. Join a daring, pioneering team that's challenging the limits of what's possible! We are on the lookout for a technical leader in SaaS client performance who is passionate about enhancing the customer experience through top-tier performance solutions. As a Senior Performance Engineer, you will spearhead comprehensive strategies to optimize application speed, memory utilization, and reliability across our entire platform. You will be empowered to analyze, diagnose, and address performance bottlenecks wherever they arise—be it front-end, back-end, or infrastructure—ensuring ClickUp remains the fastest and most reliable productivity platform available.The ideal candidate is a hands-on authority in browser and NodeJS performance, with a thorough understanding of how code influences rendering, memory management, and overall user experience. You excel in solving intricate challenges, collaborating across teams, and establishing new benchmarks for performance excellence. If you're driven to make a significant impact for millions of users, this is your chance to lead at scale.Your Responsibilities:Conduct root cause analysis on client performance issues and perform post-mortems.Profile application code to identify inefficient algorithms, memory leaks, and other issues; propose and implement effective solutions.Establish performance monitoring, alerting, and dashboards to proactively detect and resolve client performance challenges.Examine client traffic patterns, load testing outcomes, and other metrics to set benchmarks and drive enhancements.Champion performance best practices and set performance standards across the engineering organization.Identify infrastructure upgrades (caching, CDNs, database optimization) to elevate the client experience.Collaborate with development teams to incorporate performance as a core requirement in the development of new features.
Who We Are:TwelveLabs is at the forefront of developing innovative multimodal foundation models that enable video comprehension akin to human understanding. Our groundbreaking models have set new benchmarks in video-language modeling, enhancing our capabilities and revolutionizing how we engage with and analyze diverse media formats.With an impressive $107 million in Seed and Series A funding, we're supported by premier venture capital firms including NVIDIA’s NVentures, NEA, Radical Ventures, and Index Ventures, alongside influential AI pioneers like Fei-Fei Li, Silvio Savarese, and Alexandr Wang. Our headquarters in San Francisco, complemented by a significant presence in Seoul, highlights our dedication to fostering global innovation.We celebrate the individuality of every team member’s journey, believing that the diverse cultural, educational, and life experiences of our employees fuel our ability to challenge the status quo. We seek passionate individuals who resonate with our mission and are eager to make a significant impact as we advance technology to reshape the world. Join us in redefining video understanding and multimodal AI.About the RoleAs a Senior Staff Infrastructure Engineer at TwelveLabs, you will leverage your technical expertise and leadership skills to construct the systems that drive our multimodal foundation models. Your focus will be on designing and enhancing a scalable, secure, and high-performance infrastructure that accommodates extensive AI workloads across both cloud-based and on-premises environments.This position demands strong technical acumen, an eagerness to delve into low-level systems when necessary, and the capability to influence infrastructure strategy through hands-on contributions and operational improvements. Your impact will be felt through your technical expertise and the results you deliver, rather than through hierarchical status, in a dynamic and fast-paced environment.In this role, you will:Architect and advance cloud and hybrid infrastructure, blending hands-on execution with technical leadership.Guide the development of AI/ML infrastructure components, engaging directly in critical tasks when necessary.Define infrastructure standards and abstractions while maintaining close interaction with production systems.Collaborate closely with Machine Learning Engineers, Data Scientists, Backend Developers, and other key stakeholders to ensure system alignment and efficiency.
Who We AreServal is an innovative AI-driven automation platform redefining operational efficiency for enterprises. Our intelligent agents seamlessly comprehend and execute real-world workflows, replacing outdated manual processes with adaptive, self-learning software. Since our inception in early 2024, we have garnered the trust of industry leaders such as General Motors, Notion, Perplexity, Vercel, Mercor, LangChain, and Verkada, streamlining high-volume operational tasks across their organizations.At the heart of Serval is a cutting-edge agentic AI platform that transforms natural language into actionable workflows. Our agents not only respond to queries but also reason, act across various systems, and continuously enhance their performance. What started as a solution for operational tasks has rapidly expanded into a versatile AI automation layer utilized across IT, HR, Finance, Security, Legal, and Engineering sectors.Our mission is to eradicate repetitive, manual tasks within enterprises, empowering teams through intelligent automation. In the long run, we aim to establish a universal AI operations layer—a system of agents that integrates across business functions, maintaining the momentum of modern companies.We are proud to be backed by renowned investors including Sequoia Capital, Redpoint Ventures, Meritech, First Round, General Catalyst, and Elad Gil, and founded by seasoned product and engineering leaders from Verkada.Role OverviewAs a Senior Software Engineer in Infrastructure at Serval, you will be pivotal in developing and scaling the core systems that empower our AI agents and workflow automation platform. A crucial aspect of this role involves enabling and supporting self-hosted deployments for enterprise clients needing on-premises or private cloud environments. We are looking for engineers with profound expertise in distributed systems, infrastructure-as-code, production operations, and customer-facing support, who aspire to influence the technical architecture of a rapidly evolving platform.What You'll DoDesign, implement, and operate large-scale distributed systems that power Serval's AI agents, workflow orchestration, and data pipelines.Create and maintain Terraform modules to provision and manage cloud infrastructure across AWS, GCP, or Azure environments.Develop and sustain deployment packages, installation scripts, and infrastructure templates, enabling customers to self-host Serval in their own environments.Provide technical support and guidance to enterprise customers during installation and deployment phases.
At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.About the RoleWe are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.Your Responsibilities1. System Health & Reliability (SRE)On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.2. Linux & Network AdministrationOS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.3. GPU & ML Stack EngineeringDeployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are dedicated to establishing vLLM as the premier AI inference engine, propelling advancements in AI by making inference both cost-effective and expeditious. Founded by the original creators and key maintainers of vLLM, we occupy a unique position at the convergence of models and hardware—an achievement that has taken years to realize.Role OverviewWe are seeking a talented Infrastructure Engineer to develop the distributed systems that facilitate inference on a global scale. In this role, you will design and implement essential layers that allow vLLM to deploy models across thousands of accelerators with minimal latency and maximum reliability. Our vision is to make deploying cutting-edge models at scale as simple as launching a serverless database. The complexities will be seamlessly integrated into the robust infrastructure you will be creating.
Full-time|$200K/yr - $200K/yr|On-site|San Francisco
Join Convex in revolutionizing application development!At Convex, we are on a mission to redefine how software is constructed on the Internet. Our innovative platform enables developers to create swift, dependable, and dynamic applications without the need for a backend team. We offer a comprehensive full-stack application platform, meticulously designed with abstractions for databases, computing, and backend services, allowing both developers and LLMs to innovate rapidly, ensuring products that are scalable and maintain simplicity throughout their lifecycle.About Our Team:Our Convex team comprises engineers who have architected and built some of the largest backends globally, managing exabytes of data and millions of transactions per second. We are a friendly, collaborative group of passionate individuals who thrive on in-person collaboration in our San Francisco office.Position Overview:As Convex evolves, we are seeking outstanding senior or staff-level engineers to help us architect and sustain the future of our infrastructure at scale. If you have a passion for distributed systems and a robust background in designing and managing web infrastructure, we want to connect with you!We value robust architecture, effective collaboration, and simplicity. Our team embraces high ownership and places significant emphasis on operational excellence. This role is not solely focused on operations; we seek individuals who are dedicated to designing and constructing systems in the most effective manner possible, especially in a startup environment.Your Responsibilities:Architect, construct, and oversee Convex’s global cloud infrastructure.Analyze and enhance the performance and reliability of our systems.Independently prioritize projects, collaborating closely with the engineering team and CTO.Establish best practices and reliability standards as we expand our team and systems.Develop sophisticated systems and database code.Engage with feedback from leadership regarding seeking simpler and more elegant solutions.What We Value:A strong enthusiasm for distributed systems and backend infrastructure.A collaborative spirit and a desire to grow with the team.A commitment to best practices and maintaining high standards in engineering.
Join Our MissionAt Hyperbolic Labs, we are dedicated to democratizing artificial intelligence by eliminating barriers to computing power through our Open-Access AI Cloud. We aggregate global computing resources to provide an innovative GPU marketplace and AI inference service, making AI affordable and accessible for everyone. As pioneers at the crossroads of AI and open-source technology, we envision a future where AI innovation is driven by imagination, not resource limitations. We invite forward-thinking individuals who share our vision of making AI universally accessible, secure, and cost-effective to join us in crafting a platform that empowers innovators to realize their groundbreaking AI projects.As we gear up for expansion following our Series A funding, our team, led by co-founders with PhDs in AI, Mathematics, and Computer Science, is set to transform the landscape of computing.The RoleWe are on the lookout for a Senior Infrastructure Engineer to drive the development and scaling of Hyperbolic's GPU Cloud Marketplace. In this pivotal role, you will create a multi-tenancy provisioning and virtualization solution that transforms raw GPUs from diverse global suppliers into a programmable, orchestrated resource pool serving thousands of AI developers and researchers. You will work at the forefront of cloud infrastructure, building the core orchestration layer that allows our platform to deliver cost savings of up to 75% compared to traditional cloud providers.
Compensation: Competitive base salary + substantial equityBenefits: Health & dental insurance, gym reimbursement, daily team lunches, 401(K)About JuliusAt Julius, we're pioneering advancements in applied AI by developing cutting-edge coding agents. Our platform executes approximately 1 million lines of code every 36 hours, serving over 1 million users and generating 3 million+ visualizations. We manage all code in isolated remote containers. As a revenue-generating entity, we are backed by AI Grant and founders with remarkable backgrounds from companies like Vercel, Notion, Perplexity, Palantir, Replit, Zapier, Intercom, and Dropbox.The RoleJoin us in building and scaling the robust code-execution platform that powers Julius, across both cloud and on-prem environments. We orchestrate over 500,000 containers/month and the demand is growing rapidly. You will take ownership of reliability, performance, and security within our multi-tenant compute environment.Your ResponsibilitiesDesign and manage a secure, multi-tenant container infrastructure that ensures quick startup and intelligent autoscaling.Implement on-prem/private cloud deployments using Helm and Terraform, integrating SSO, network controls, and audit logging.Enhance observability (metrics, traces, logs) with well-defined SLOs and lead incident response initiatives.Optimize images, scheduling, networking, and costs, while developing fair-use and rate-limiting controls.Your QualificationsStrong experience with production Kubernetes and container internals (Docker/containerd); solid understanding of networking principles.Familiarity with cloud environments (AWS/GCP/Azure) and Infrastructure as Code (Terraform/Helm).Proficiency in monitoring and logging tools (Prometheus, Grafana, OpenTelemetry, ELK/Vector).Understanding of security best practices for containerized, multi-tenant systems.Preferred QualificationsExperience with gVisor, Kata, Firecracker; Cilium/eBPF; GPU scheduling; serverless autoscaling (KEDA/Knative/Karpenter).Proven experience delivering on-prem or air-gapped enterprise software solutions.A passion for AI, with experience building side projects involving LLMs.Why Join Julius?Be part of a small, senior team where your contributions will have a massive impact. Tackle challenging infrastructure problems at a meaningful scale.
Aug 11, 2025
Sign in to browse more jobs
Create account — see all 6,864 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.