Ai Engineer Llm Infrastructure jobs in San Francisco – Browse 7,345 openings on RoboApply Jobs

Ai Engineer Llm Infrastructure jobs in San Francisco

Open roles matching “Ai Engineer Llm Infrastructure” with location signals for San Francisco. 7,345 active listings on RoboApply Jobs.

7,345 jobs found

1 - 20 of 7,345 Jobs
Apply
Yutori logo
Full-time|On-site|San Francisco, California, United States

At Yutori, we are transforming the way individuals engage with the digital realm by developing AI agents capable of efficiently performing everyday online tasks. Our approach is to create a comprehensive, agent-first ecosystem, encompassing everything from training proprietary models to designing innovative generative product interfaces.To further this missi…

Mar 26, 2025
Apply
Scale AI logo
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY

Join our dynamic Machine Learning Infrastructure team as a Senior AI Infrastructure Engineer, where you will play a pivotal role in designing and constructing platforms that ensure the scalable, reliable, and efficient serving of Large Language Models (LLMs). Our innovative platform supports a range of cutting-edge research and production systems, catering to both internal and external applications across diverse environments.The ideal candidate will possess a solid foundation in machine learning principles coupled with extensive experience in backend system architecture. You will thrive in a collaborative environment that bridges research and engineering, working diligently to provide seamless experiences for our customers and accelerating innovation across the organization.

Mar 26, 2026
Apply
Ivo logo
Full-time|On-site|San Francisco, California

Join the Crew of Ivo!At Ivo, we are more than just engineers; we are the pioneers of the digital seas! Our crew has set sail with groundbreaking innovations that have reshaped the landscape of legal tech:• An AI agent that seamlessly integrates with MS Word to enhance your documents [2023]• Transitioning from traditional embedding models to agentic RAG for superior performance [2023]• Advancing large-scale LLM-driven legal fact extraction [2024]• A legal assistant capable of accurately searching vast contract databases [2024]• Clustering legal documents from the same lineage [2025]• Implementing automatic deviation analysis to uncover hidden risks in extensive contract databases [2025]• Merging contracts with amendments to create comprehensive “composite” contracts (one of our clients shed tears of joy upon seeing this) [2025]The Role of an Infrastructure EngineerAs an Infrastructure Engineer, you will be the architect of Ivo's platform, ensuring its robustness and scalability.Your mission includes:• Taking ownership of our environment's future, with ample room for creative system design.• Managing numerous customer deployments—every client deserves a unique setup, from containers to databases.• Instrumenting our systems to identify performance bottlenecks and errors.• Aggregating metrics, logs, and health checks into user-friendly dashboards and alerts.• Leading the charge during infrastructure incidents.• Accelerating our CI/CD system (currently a sluggish ~12 minutes—let's speed that up!).If you share our passion for LLMs and thrive in a dynamic environment, we want you to help us push the boundaries of DevOps:• Innovating real-time LLM evaluations to ensure the accuracy of our outputs.• Building upon our existing infrastructure to enhance performance and reliability.Set sail with us at Ivo, where your technical skills will help chart the course for the future of legal technology!

Mar 5, 2026
Apply
novita-ai logo
Full-time|On-site|San Francisco

About Us:At novita-ai, we are a rapidly growing global provider of AI cloud infrastructure, leading the charge in the artificial intelligence revolution. Our innovative platform equips developers and enterprises with powerful, scalable, and user-friendly solutions such as Model APIs, GPU Instances, and Serverless Computing. As organizations around the globe strive to integrate AI into their offerings, we serve as the essential engine that fuels their innovative efforts.Join our world-class team and contribute to our expanding customer base. This unique opportunity allows you to be part of a dynamic company in a hyper-growth market, where your technical skills will directly impact customer success and drive our business forward.The Role:As a Solutions Engineer, you will act as the primary technical leader and trusted advisor for our clients throughout their journey. You will collaborate closely with the sales team to bridge the gap between complex customer challenges and our sophisticated technical solutions. Your mission is to build technical credibility, demonstrate the capabilities of our platform, and design tailored solutions that empower our clients to achieve their AI-related business objectives.What You'll Do:Technical Discovery & Solution Design: Collaborate with Account Executives to gain a deep understanding of customer needs, technical requirements, and business goals. Develop elegant and effective solutions utilizing our AI infrastructure stack (Model APIs, GPU Instances, Serverless).Product Demonstration & Proof of Concept (POC): Conduct engaging, customized product demonstrations and interactive workshops. Plan, manage, and execute successful POCs, showcasing the value and performance of our platform within the client’s environment.Technical Evangelism & Trusted Advisory: Communicate the value proposition of our platform to diverse audiences, including both technical and non-technical stakeholders, from engineers to C-level executives. Establish yourself as the go-to expert for customers on best practices in AI infrastructure.Sales Enablement & Market Feedback Loop: Create and maintain technical sales materials, including whitepapers, best practice guides, and demo scripts. Serve as the voice of the customer, relaying valuable feedback from the field to our Product and Engineering teams to influence our product roadmap.Onboarding & Implementation Guidance: Facilitate a seamless post-sales transition by providing initial onboarding support and architectural guidance, setting customers up for sustained success.

Aug 27, 2025
Apply
Retell AI logo
Full-time|On-site|San Francisco Bay Area

About Retell AI Retell AI builds voice AI technology that helps businesses transform their call center operations. In just 18 months, thousands of companies have adopted Retell’s AI voice agents to streamline sales, support, and logistics, work that once required large human teams. Backed by investors including Y Combinator and Alt Capital, Retell has grown annual recurring revenue from $5M to $36M with a focused team of 20. The company’s goal for 2026: a modern customer experience platform where AI powers entire contact centers. Retell is developing AI “workers” that can serve as frontline agents, quality assurance analysts, and managers, handling, evaluating, and improving customer interactions on their own. Named a top 50 AI app by a16z: https://tinyurl.com/5853dt2x Ranked #4 on Brex’s Fast-Growing Software Vendors of 2025: https://www.brex.com/journal/brex-benchmark-december-2025 Featured on the Lean AI Leaderboard: https://leanaileaderboard.com/ Role Overview: Research Scientist – LLM Retell AI is hiring a Research Scientist focused on large language models (LLMs) and audio processing. This role suits machine learning researchers who want to push the boundaries of real-time AI and see their work in production. What You Will Do Investigate new approaches in large language models and audio processing for human-like voice agents Design and implement evaluation methods for complex, real-world conversational systems Prototype systems to improve reasoning, reduce latency, and enhance conversation quality Work closely with engineering and product teams to bring research advances into production Impact Research at Retell directly shapes the capabilities of voice AI agents for thousands of businesses. The work blends advanced research with practical deployment, improving how customers interact with automated systems across industries. Location This position is based in the San Francisco Bay Area.

Apr 14, 2026
Apply
OpenAI logo
Full-time|On-site|San Francisco

About Our Innovative TeamJoin the Workload team at OpenAI, where we are at the forefront of designing and managing the cutting-edge infrastructure that drives the training and inference of large language models (LLMs) at an unprecedented scale. Our systems are engineered to harmonize the complex processes of model training and serving, abstracting performance, parallelism, and execution across extensive GPU and accelerator networks. This robust foundation allows researchers to concentrate on elevating model capabilities, while we take care of the scalability, efficiency, and reliability needed to bring these advanced models to life.Your Role and ResponsibilitiesWe are seeking a talented engineer to design and implement the dataset infrastructure that will fuel OpenAI’s next-generation training stack. Your primary focus will be on creating standardized dataset interfaces, scaling pipelines across thousands of GPUs, and proactively identifying and addressing performance bottlenecks. Collaboration with multimodal researchers and infrastructure teams will be key to ensuring that our datasets are unified, efficient, and user-friendly.Key Responsibilities Include:Design and maintain standardized dataset APIs, including those for multimodal (MM) data that exceeds memory capacity.Develop proactive testing and validation pipelines for dataset loading at GPU scale.Work collaboratively to integrate datasets into training and inference pipelines, ensuring seamless user experiences.Document and maintain dataset interfaces to ensure they are discoverable, consistent, and easily adoptable by other teams.Establish validation systems to assure datasets remain reproducible and unchanged once standardized.Identify and troubleshoot performance bottlenecks in distributed dataset loading, such as stragglers impacting global training speed.Create visualization and inspection tools to highlight errors, bugs, or bottlenecks in datasets.Ideal Candidate ProfilePossess strong engineering fundamentals and experience in distributed systems, data pipelines, or infrastructure.Have a proven track record in building APIs, modular code, and scalable abstractions, with a user-centric approach to design.Be adept at debugging performance issues across large-scale machine fleets.Demonstrate a passion for advancing data infrastructure to enhance research capabilities.

Sep 18, 2025
Apply
Hyperbolic Labs logo
Full-time|On-site|San Francisco, CA

Join Our MissionAt Hyperbolic Labs, we are dedicated to democratizing artificial intelligence by eliminating barriers to computing power through our Open-Access AI Cloud. We aggregate global computing resources to provide an innovative GPU marketplace and AI inference service, making AI affordable and accessible for everyone. As pioneers at the crossroads of AI and open-source technology, we envision a future where AI innovation is driven by imagination, not resource limitations. We invite forward-thinking individuals who share our vision of making AI universally accessible, secure, and cost-effective to join us in crafting a platform that empowers innovators to realize their groundbreaking AI projects.As we gear up for expansion following our Series A funding, our team, led by co-founders with PhDs in AI, Mathematics, and Computer Science, is set to transform the landscape of computing.The RoleWe are on the lookout for a Senior Infrastructure Engineer to drive the development and scaling of Hyperbolic's GPU Cloud Marketplace. In this pivotal role, you will create a multi-tenancy provisioning and virtualization solution that transforms raw GPUs from diverse global suppliers into a programmable, orchestrated resource pool serving thousands of AI developers and researchers. You will work at the forefront of cloud infrastructure, building the core orchestration layer that allows our platform to deliver cost savings of up to 75% compared to traditional cloud providers.

Mar 26, 2026
Apply
Spellbrush logo
Full-time|On-site|San Francisco

Join Our Team as an AI Infrastructure EngineerAt Spellbrush, the premier generative AI studio behind niji・journey, we are in search of a talented AI Infrastructure Engineer to help us develop and enhance our end-to-end machine learning infrastructure, facilitating the operation of our models across a variety of platforms.Key Responsibilities:Design, implement, and maintain next-generation inference architecture to optimize the performance of our models across mobile, web, and other platforms.Collaborate with a dynamic team focused on creating cutting-edge image generation models that serve over 16 million users globally.Ideal Candidate Profile:Experience with Large Distributed Systems: You possess a strong background in working with modern technologies such as Kubernetes (K8S), Kafka, NATS, Redis, among others. Your hands-on experience spans both on-premises and multi-cloud environments, and you understand the intricacies and potential pitfalls of each system.Expertise in GPU Workloads: Your understanding of GPU processing for handling substantial workloads sets you apart. Having experience in deploying or optimizing GPU workloads end-to-end is a significant advantage.Passion for Anime Aesthetics: As avid anime enthusiasts, we value team members who share our passion for the anime aesthetic, contributing to a creative movement that engages millions.Team Player in Fast-Paced Environments: You thrive in small, agile teams and are eager to work alongside some of the world's top AI researchers, contributing to the best image models globally. We believe in the power of in-person collaboration, with opportunities at our offices in Tokyo (downtown Akihabara) or San Francisco. Visa sponsorships are available.

Feb 7, 2024
Apply
Retell AI logo
Full-time|On-site|San Francisco Bay Area

Join the Revolution at Retell AIRetell AI is pioneering the future of call centers through innovative voice AI, driven by first principles thinking.In just 18 months since our inception, we have empowered thousands of businesses with our AI voice agents, transforming how sales, support, and logistics calls are managed—previously requiring extensive human teams. Supported by prestigious investors such as Y Combinator and Alt Capital, we've rapidly scaled from $5M ARR to an impressive $36M ARR with a compact yet dynamic team of 20.Our ambition for 2026 is to create a revolutionary customer experience platform, where entire contact centers are powered by AI. Moving beyond basic automation, we aim to develop intelligent AI “workers” that serve as frontline agents, QA analysts, and managers, continuously enhancing customer interactions without the need for constant human oversight.As we expand, we are seeking passionate engineers who are eager to solve challenging technical problems, act swiftly, and make a significant impact in one of the fastest-growing voice AI startups. Let’s shape the future together.

Aug 12, 2025
Apply
Scale AI logo
Full-time|$138K/yr - $259.4K/yr|On-site|San Francisco, CA; St. Louis, MO; New York, NY; Washington, DC

Scale AI is on the lookout for an exceptionally talented and driven Software Engineer, Frontier AI Infrastructure to become an integral part of our innovative Public Sector Engineering team. In this role, you will take charge of the model inference layer, enabling cutting-edge AI models, troubleshooting the latest AI tools, managing networking tasks, addressing latency issues, and monitoring pricing and usage metrics for AI models. You will spearhead technical discussions with cloud vendors and clients to fulfill critical contracts and resolve platform challenges. Additionally, you will collaborate closely with Product teams to anticipate feature requirements, transitioning from reactive 'infra-only debugging' to proactive integration testing.Your Responsibilities Include:Designing and implementing secure, scalable backend systems tailored for Public Sector clients, utilizing Scale's advanced cloud-native AI infrastructure.Owning services or systems while defining long-term health objectives and enhancing the health of related components.Redesigning the architecture to operate in compliant or restrictive environments, which entails creating swappable components (authentication, storage, logging) to adhere to government and security regulations without compromising product integrity.Collaborating with Product teams to develop integration tests that identify issues early, shifting focus from 'infra-only debugging' to preventing upstream failures.Actively participating in customer engagements, liaising with stakeholders to comprehend requirements and deliver innovative solutions.Contributing to the platform roadmap and product strategy for Scale AI's Public Sector division, playing a vital role in shaping the future trajectory of our offerings.

Mar 26, 2026
Apply
TRM Labs logo
Full-time|$200K/yr - $240K/yr|On-site|San Francisco, CA

Join Us in Building a Safer World.At TRM Labs, we specialize in blockchain analytics and AI solutions aimed at assisting law enforcement, national security agencies, financial institutions, and cryptocurrency businesses in identifying, investigating, and preventing crypto-related fraud and financial crime. Our innovative platforms leverage blockchain intelligence and AI technology to trace funds, detect illicit activity, and construct comprehensive threat profiles. Trusted by leading organizations worldwide, TRM Labs is committed to enabling a safer and more secure environment for all.Our AI Engineering Team is dedicated to pioneering next-generation AI applications, particularly in the realm of Large Language Models (LLMs) and agentic systems. Our goal is to develop resilient pipelines and high-performance infrastructure that facilitate the swift, safe, and scalable deployment of AI systems.We manage extensive petabyte-scale pipelines, ensuring model serving with millisecond latency while providing the necessary observability and governance to make AI production-ready. Our team actively evaluates and integrates leading-edge tools in the LLM and agent space, including open-source stacks, vector databases, evaluation frameworks, and orchestration tools to accelerate TRM’s innovation pace.As a Senior or Staff ML Systems Engineer – LLM, you will play a pivotal role in constructing and scaling our technical infrastructure for AI/ML systems. Your responsibilities will include:Creating reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools such as Langfuse, GitHub Actions, and experiment tracking.Automating model versioning, approval processes, and compliance checks across various environments.Developing a modular and scalable AI infrastructure stack that encompasses vector databases, feature stores, model registries, and observability tools.Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.Continuously assessing and incorporating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).Promoting AI reliability and governance while enabling experimentation, ensuring compliance, security, and continuous uptime.Enhancing AI/ML Model Performance and ensuring data accuracy and consistency, leading to improved model training and inference.Implementing infrastructure to facilitate both offline and online evaluation of LLMs and agents.

Mar 12, 2026
Apply
Andromeda Cluster logo
Full-time|Remote|North America Remote / San Francisco, CA

Join Our Team as a Software Engineer - AI InfrastructureLocation: North America Remote / San Francisco · Full-TimeAt Andromeda Cluster, we are dedicated to democratizing access to advanced AI infrastructure that was once only available to hyperscalers. Founded by industry leaders Nat Friedman and Daniel Gross, we have evolved from a singular managed cluster to a global platform that connects top AI labs, data centers, and cloud providers around the world. Our orchestration layer efficiently manages training and inference tasks globally, enhancing flexibility and efficiency in this rapidly expanding sector. We aim to create a global marketplace for AI computing, empowering AGI with the same fluidity as global financial markets.As we continue to grow, we are on the lookout for talented individuals in the fields of AI infrastructure, research, and engineering.Your RoleIn the position of Infrastructure Product Engineer, you will be integral in constructing the foundational framework of Andromeda’s platform. Your challenge will be to simplify complex, real-world infrastructure issues into scalable product solutions that our customers will benefit from.Key ResponsibilitiesArchitect and develop essential platform components, focusing on infrastructure orchestration, provisioning, and lifecycle management solutions.Create robust APIs, services, and control planes that abstract diverse infrastructure types, including VMs, Kubernetes, bare metal, and schedulers.Convert customer usage patterns into actionable product requirements, delivering impactful features and enhancements.Design automation and internal tools to mitigate manual and ad-hoc operational tasks.Improve platform reliability, performance, and observability, focusing on sustainable enhancements rather than quick fixes.Collaborate with other teams to establish clear ownership boundaries between platform features and customer-specific solutions.Write clean, maintainable, and well-documented code with a focus on long-term sustainability.Engage in technical design discussions and contribute to the architectural advancements of our platform.

Feb 18, 2026
Apply
Genesis Molecular AI logo
Full-time|On-site|NYC or SF Bay Area

Genesis Molecular AI is building the GEMS molecular AI platform, driving advances in foundation model training and industrial screening. Strategic partnerships and a strong compute infrastructure are central to the company’s growth and mission. Role Overview The Director of AI Infrastructure Partnerships will lead efforts to secure and manage critical technology alliances, investments, and compute resources. This leader will work closely with top AI organizations, hardware providers, and investors, including firms like a16z and NVIDIA, to support Genesis’s technical and business goals. The role is based in either New York City or the San Francisco Bay Area. What You Will Do Oversee partnerships with NVIDIA and identify new opportunities with leading AI organizations. Structure contracts, equity deals, technical collaborations, co-publications, and data-sharing agreements for both public and proprietary experimental and synthetic data. Create presentations and written materials that clearly communicate Genesis’s platform vision and technical strengths to partners and investors, and integrate these messages into broader external communications. Serve as the business lead and chief negotiator for major cloud computing and AI infrastructure deals. Secure high-performance compute at competitive rates and maintain strong relationships with key partners. Monitor the AI compute market, evaluating providers for cost, reliability, and availability to support research and deployment needs. Work with ML Engineering to forecast compute requirements for model training, synthetic data generation, fine-tuning, and large-scale inference. Optimize performance and budget across multiple cloud environments and track usage to maximize value. Manage the internal budgeting process for compute spend. Translate technical needs into financial forecasts and present capital allocation recommendations to company leadership. What We’re Looking For Significant experience in AI and cloud computing, including managing high-value negotiations and partnerships. Strong analytical and strategic skills, with the ability to assess market trends and make informed decisions. Excellent communication and interpersonal abilities, comfortable explaining complex topics to a range of audiences.

Apr 15, 2026
Apply
Software Apps Inc. logo
Full-time|Hybrid|San Francisco

About UsAt Software Apps Inc., we are pioneering the field of natural-language computing with our flagship product, Sky, designed specifically for Mac users. Our team is driven by a shared commitment to innovation, collaboration, and excellence. To learn more about our mission and values, visit www.software.inc/jobs.The RoleWe are seeking a talented LLM Engineer to be a vital contributor to our product development. In this role, you will design and optimize our data pipelines, refine our system architecture, and implement evaluation mechanisms. Your expertise will guide strategic decisions, balancing ambition with practicality, while you manage the iterative process of fine-tuning, evaluation, and deployment.Your Daily Responsibilities Will Include......Innovating Exceptional Software. Your enthusiasm and insight will transform visionary concepts into actionable strategies, even if it sometimes means taking risks without a clear path ahead. Your capacity to learn and adapt is more crucial than existing knowledge....Taking Full Ownership of Projects. You will be the driving force behind the success of your systems and features, demonstrating a commitment to delivering results. Your proactive approach will ensure continual improvement through feedback and quality enhancements....Influencing Architectural Decisions. Leverage your familiarity with the latest model architectures to create robust systems for data collection, training, and inference. Your role will include optimizing model performance while carefully considering user privacy and effective data gathering....Thinking Both Broadly and Precisely. You recognize that your infrastructure choices significantly affect the end-user experience. With a focus on performance, you will develop large-scale models for cloud inference or finely tuned on-device models that prioritize efficiency.

May 14, 2025
Apply
TRM Labs logo
Full-time|$200K/yr - $240K/yr|On-site|San Francisco, CA

Contribute to a Safer Future.TRM Labs is at the forefront of blockchain analytics and AI technology, empowering law enforcement, financial institutions, and cryptocurrency enterprises to identify and combat cryptocurrency-related fraud and financial crime. Our innovative blockchain intelligence and AI tools are designed to trace fund flows, pinpoint illicit activities, build comprehensive cases, and provide actionable insights into potential threats. Trusted by prominent agencies and organizations globally, TRM is committed to fostering a safer and more secure environment for everyone.Join our dynamic AI Engineering Team, dedicated to pioneering next-generation AI applications, with a particular emphasis on Large Language Models (LLMs) and agent-based systems. Our objective is to create efficient pipelines, high-caliber infrastructure, and operational tools that facilitate the rapid, safe, and scalable deployment of AI systems.We oversee petabyte-scale data pipelines, deliver models with millisecond latency, and ensure the observability and governance necessary to make AI production-ready. Our team actively evaluates and integrates cutting-edge technologies in the LLM and agent domains, utilizing open-source stacks, vector databases, evaluation frameworks, and orchestration tools that enhance TRM’s agility and innovation capacity.As a Senior or Staff AI Infrastructure Engineer, you will play a pivotal role in constructing and scaling the technical framework for AI and ML systems. Your responsibilities will include:Developing reusable CI/CD workflows for model training, evaluation, and deployment, integrating tools like Langfuse, GitHub Actions, and experiment tracking systems.Automating model versioning, approval workflows, and compliance checks across various environments.Building a modular and scalable AI infrastructure stack, encompassing vector databases, feature stores, model registries, and observability tools.Collaborating with engineering and data science teams to embed AI models and agents into real-time applications and workflows.Continuously assessing and integrating state-of-the-art AI tools (e.g., LangChain, LlamaIndex, vLLM, MLflow, BentoML).Driving AI reliability and governance, facilitating experimentation while ensuring compliance, security, and uptime.Enhancing the performance of AI and ML models.Ensuring data accuracy, consistency, and reliability for improved model training and inference.Deploying infrastructure to support both offline and online evaluations of LLMs and agents.

Mar 12, 2026
Apply
Similarweb logo
Full-time|$125K/yr - $175K/yr|On-site|San Francisco, CA

At Similarweb, we are transforming the way businesses engage with the digital landscape by providing comprehensive insights into online activities. Our innovative data solutions empower over 6,000 global clients, including major players like Google, eBay, and Adidas, enabling them to make pivotal decisions that enhance their digital strategies. Since going public on the New York Stock Exchange in 2021, we have continued to achieve remarkable growth! Join a team of bright, inquisitive, and practical individuals who are passionate about making a difference. We are seeking a Strategic Sales Manager to further expand Similarweb’s presence by strengthening and growing our most valuable global accounts. This position reports directly to our Vice President of Strategic Sales & Account Management. Why This Role is Important Similarweb’s digital intelligence platform supports thousands of organizations around the globe, and we are merely scratching the surface of our addressable market. As a Strategic Sales Manager, you will own the entire sales cycle and cultivate essential relationships across a portfolio of Fortune 500 and other premier enterprises. With our leading product, robust brand momentum, and an exceptionally supportive team, you will be positioned to consistently exceed your quotas and make a significant impact on our revenue growth.

Jan 9, 2026
Apply
Scale AI logo
Full-time|$216.2K/yr - $270.3K/yr|On-site|San Francisco, CA; New York, NY

Join Scale AI's innovative team as an Infrastructure Software Engineer for our Enterprise Generative AI Platform (SGP). In this dynamic role, you will help design and enhance our enterprise-grade AI platform, which offers robust APIs for knowledge retrieval, inference, evaluation, and more. We're seeking an exceptional engineer who thrives in fast-paced environments and is eager to contribute to the scaling of our core infrastructure. The ideal candidate will possess a solid foundation in software engineering principles and extensive experience with large-scale distributed systems. Your role will involve implementing solutions across various cloud providers (GCP, Azure, AWS) for clients in highly regulated sectors, including healthcare, telecommunications, finance, and retail.

Mar 26, 2026
Apply
Eventual Computing logo
Full-time|On-site|San Francisco

About EventualAt Eventual, we are reimagining how AI applications process vast amounts of data, from images to complex datasets. Traditional data platforms are not equipped to handle the petabytes of multimodal data essential for AI, causing teams to struggle with inadequate infrastructure. Founded in 2022, our mission is to simplify data querying, making it as intuitive as working with tables while ensuring scalability for production workloads.Our open-source engine, Daft, is specifically designed for real-world AI systems. It efficiently manages external APIs, GPU clusters, and addresses failures that traditional engines cannot handle. Daft is already integral to operations at leading companies such as Amazon, Mobileye, Together AI, and CloudKitchens.We pride ourselves on our exceptional team, which includes talents from Databricks, AWS, Nvidia, Pinecone, GitHub Copilot, Tesla, and others. We have quadrupled our team size in just a year, supported by Series A and seed funding from notable investors like Felicis, CRV, Microsoft M12, and Y Combinator. We are now eager to expand further. Join us—Eventual is just getting started.We are seeking passionate individuals who are excited to collaborate in a close-knit team environment, working together four days a week in our San Francisco Mission district office.Your Role:As a Software Engineer, you will take charge of developing Eventual's core products and architecture. You’ll deliver features that our customers will use immediately and collaborate with a dedicated team that values open communication and cross-functional teamwork. Our fast-paced environment is focused on solving a variety of complex technical and product challenges. While our experienced team is here to provide guidance and mentorship, we appreciate engineers who can independently identify and tackle challenging technical issues.Key Responsibilities:Design and develop highly reliable and resilient products and features.Collaborate closely with cross-functional product and customer-facing teams to understand requirements and deliver thoughtful solutions.Write high-quality, extensible, and maintainable code.Create and build scalable applications and components.Architect and manage Kubernetes clusters optimized for our needs.

Sep 22, 2025
Apply
Whatnot logo
Full-time|On-site|San Francisco, CA

Join Whatnot as an LLM Platform Engineer where you'll be at the forefront of developing and optimizing cutting-edge language models. In this role, you will collaborate with a dynamic team of engineers and data scientists to enhance our machine learning infrastructure and algorithms. Your contributions will directly impact the efficiency and effectiveness of our language understanding capabilities.

Mar 3, 2026
Apply
Andromeda Cluster logo
Full-time|Remote|Global Remote / San Francisco, CA

Site Reliability Engineer - AI InfrastructureLocation: Global Remote / San Francisco · Full-TimeAbout AndromedaAndromeda Cluster, established by Nat Friedman and Daniel Gross, aims to democratize access to advanced AI infrastructure for early-stage startups, previously exclusive to hyperscalers. Our journey began with a single managed cluster that quickly reached capacity, propelling us to develop robust systems, networking, and orchestration layers to make AI infrastructure more accessible than ever.Today, we collaborate with top AI laboratories, data centers, and cloud service providers to deliver compute resources precisely when and where they're needed the most. Our platform efficiently manages the routing of training and inference jobs across a global supply chain, facilitating flexibility and efficiency in one of the most rapidly expanding markets worldwide.Our vision is to create a liquidity layer for global AI compute — a marketplace that dynamically moves the infrastructure and workloads essential for AGI, akin to the capital flows in global financial markets.We are on the lookout for talented individuals who excel in AI infrastructure, research, and engineering to join our pioneering team.Your ResponsibilitiesProvision, configure, and manage Kubernetes clusters for clients across various service providers.Develop automation tools to enhance the deployment and integration of clusters.Troubleshoot customer issues related to networking, storage, scheduling, and system layers.Enhance the reliability and scalability of training and inference infrastructures.Design and implement monitoring, alerting, and observability solutions for critical systems.Work collaboratively with engineering and product teams to strategize and deliver infrastructure for new services.Engage in on-call duties and incident response, leading postmortems and reliability enhancements.Ideal Candidate ProfileA minimum of 5 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles.Solid foundation in Linux systems and networking principles.Extensive expertise in Kubernetes and container orchestration at scale.Proficient in Infrastructure-as-Code methodologies (Terraform, Helm, etc.).

Nov 6, 2025

Sign in to browse more jobs

Create account — see all 7,345 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.