Software Engineer, Inference Deployment

AnthropicSan Francisco, CA | New York City, NY | Seattle, WA

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Qualifications

The ideal candidate will possess a strong background in software engineering with experience in building large-scale deployment systems. Proficiency in orchestration tools and cloud infrastructure (GPUs, TPUs, Trainium) is essential. Familiarity with pipeline architectures, deployment scheduling, and observability tools is required. You should demonstrate strong problem-solving skills and the ability to work collaboratively within a fast-paced environment. A commitment to continuous improvement and a passion for machine learning are highly valued.

About the job

Join Anthropic as a Software Engineer on our Launch Engineering team, where your focus will be on designing and building cutting-edge deployment infrastructure for inference code. You will ensure our AI models, at scale, are continuously and seamlessly deployed to production. This role is pivotal in optimizing resource management while maximizing deployment efficiency. Your expertise will be essential in navigating complex deployment challenges, validating systems, and ensuring minimal disruption to our user services. If you thrive in tackling ambitious problems at the intersection of automation and resource management, this position offers the opportunity to make a significant impact.

About Anthropic

At Anthropic, our mission is to develop AI systems that are safe, interpretable, and controllable. We are a rapidly growing team of dedicated researchers, engineers, policy experts, and business leaders united in our goal to build AI that benefits both users and society. Join us in our innovative and inclusive work environment, where we prioritize safety, reliability, and ethical practices in AI development.

Similar jobs

1 - 20 of 5,501 Jobs

Search for Software Engineer Inference Multi Modal

5,501 results

Select all on this page (20)

Apply

Software Engineer, Inference - Multi Modal

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin OpenAI’s dynamic Inference team, where we empower the deployment of cutting-edge AI models, including our renowned GPT models, advanced Image Generation capabilities, and Whisper, across diverse platforms. Our mission is to ensure these models are not only high-performing and scalable but also available for real-world applications. Collaborating closely with our Research team, we’re committed to bringing the next generation of AI innovations to fruition. As a compact, agile team, we prioritize delivering an exceptional developer experience while continuously pushing the frontiers of artificial intelligence.As we expand our focus into multimodal inference, we are building the necessary infrastructure to support models that process images, audio, and other non-text modalities. This work involves tackling diverse model sizes and interactions, managing complex input/output formats, and ensuring seamless collaboration between product and research teams.About The RoleWe are seeking a passionate Software Engineer to aid in the large-scale deployment of OpenAI’s multimodal models. You will join a small yet impactful team dedicated to creating robust, high-performance infrastructure for real-time audio, image, and various multimodal workloads in production environments.This position is inherently collaborative; you will work directly with researchers who develop these models and with product teams to define novel interaction modalities. Your contributions will enable users to generate speech, interpret images, and engage with models in innovative ways that extend beyond traditional text-based interactions.Key Responsibilities:Design and implement advanced inference infrastructure for large-scale multimodal models.Optimize systems for high-throughput and low-latency processing of image and audio inputs and outputs.Facilitate the transition of experimental research workflows into dependable production services.Engage closely with researchers, infrastructure teams, and product engineers to deploy state-of-the-art capabilities.Contribute to systemic enhancements, including GPU utilization, tensor parallelism, and hardware abstraction layers.You May Excel In This Role If You:Have a proven track record of building and scaling inference systems for large language models or multimodal architectures.Possess experience with GPU-based machine learning workloads and a solid understanding of the performance dynamics associated with large models, particularly with intricate data types like images or audio.Thrive in a fast-paced, experimental environment and enjoy collaborating with cross-functional teams to drive impactful results.

May 21, 2025

Apply

Solutions Architect at Modal | San Francisco

Modal

Full-time|On-site|San Francisco

About Us:Modal is at the forefront of AI infrastructure. We provide seamless access to GPUs, quick container startups, and integrated storage solutions, simplifying the process of model training, batch job execution, and low-latency inference. Leading companies such as Suno, Lovable, and Substack trust Modal to transition from prototypes to full-scale production without the complexities of infrastructure management.Our rapidly expanding team operates out of NYC, San Francisco, and Stockholm. With a remarkable 9-figure annual recurring revenue (ARR), we recently achieved a valuation of $1.1 billion after our successful Series B funding round. Thousands of customers, including Lovable, Scale AI, Substack, and Suno, depend on us for their AI workload needs.Joining Modal means becoming part of a dynamic, rapidly growing AI infrastructure company with substantial opportunities for personal and professional advancement. Our team comprises creators of well-known open-source projects (e.g., Seaborn, Luigi), academic researchers, international competition medalists, and seasoned engineering and product leaders with extensive experience.The Role:We are seeking a strategic Solutions Architect to influence technical strategies across our key enterprise accounts.In this position, you will serve as the technical partner to Enterprise Account Executives, managing intricate evaluations, developing infrastructure modernization strategies, and promoting multi-product adoption across AI and ML applications.This is a strategic, consultative role that demands a robust architectural background, a strong executive presence, and the ability to impact major infrastructure decisions worth millions. You will collaborate directly with CTOs, VPs of Engineering, and leaders of ML platforms to redefine the construction and operation of AI infrastructure.If you excel in fast-paced technical sales contexts and are eager to shape the infrastructure that drives modern AI innovations, we would love to hear from you.

Feb 20, 2026

Apply

Software Engineer, Inference

Pulse

Full-time|On-site|San Francisco

OverviewAt Pulse, we are revolutionizing the way data infrastructure operates by addressing the critical challenge of accurately extracting structured information from intricate documents on a large scale. Our innovative document understanding technique merges intelligent schema mapping with advanced extraction models, outperforming traditional OCR and parsing methods.Located in the heart of San Francisco, we are a dynamic team of engineers dedicated to empowering Fortune 100 enterprises, YC startups, public investment firms, and growth-stage companies. Backed by top-tier investors, we are rapidly expanding our footprint in the industry.What sets our technology apart is our sophisticated multi-stage architecture, which includes:Specialized models for layout understanding and component detectionLow-latency OCR models designed for precise extractionAdvanced algorithms for reading-order in complex document structuresProprietary methods for table structure recognition and parsingFine-tuned vision-language models for interpreting charts, tables, and figuresIf you possess a strong passion for the convergence of computer vision, natural language processing, and data infrastructure, your contributions at Pulse will significantly impact our clients and help shape the future of document intelligence.

Jul 30, 2025

Apply

Software Engineer, Model Inference

OpenAI

Full-time|On-site|San Francisco

About Our TeamJoin the Inference team at OpenAI, where we leverage cutting-edge research and technology to deliver exceptional AI products to consumers, enterprises, and developers. Our mission is to empower users to harness the full potential of our advanced AI models, enabling unprecedented capabilities. We prioritize efficient and high-performance model inference while accelerating research advancements.About the RoleWe are seeking a passionate Software Engineer to optimize some of the world's largest and most sophisticated AI models for deployment in high-volume, low-latency, and highly available production and research environments.Key ResponsibilitiesCollaborate with machine learning researchers, engineers, and product managers to transition our latest technologies into production.Work closely with researchers to enable advanced research initiatives through innovative engineering solutions.Implement new techniques, tools, and architectures that enhance the performance, latency, throughput, and effectiveness of our model inference stack.Develop tools to identify bottlenecks and instability sources, designing and implementing solutions for priority issues.Optimize our code and Azure VM fleet to maximize every FLOP and GB of GPU RAM available.You Will Excel in This Role If You:Possess a solid understanding of modern machine learning architectures and an intuitive grasp of performance optimization strategies, especially for inference.Take ownership of problems end-to-end, demonstrating a willingness to acquire any necessary knowledge to achieve results.Bring at least 5 years of professional software engineering experience.Have or can quickly develop expertise in PyTorch, NVidia GPUs, and relevant optimization software stacks (such as NCCL, CUDA), along with HPC technologies like InfiniBand, MPI, and NVLink.Have experience in architecting, building, monitoring, and debugging production distributed systems, with bonus points for working on performance-critical systems.Have successfully rebuilt or significantly refactored production systems multiple times to accommodate rapid scaling.Are self-driven, enjoying the challenge of identifying and addressing the most critical problems.

Feb 6, 2025

Apply

Software Engineer, Inference – AMD GPU Enablement

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Inference team at OpenAI is dedicated to translating our cutting-edge research into accessible, transformative technology for consumers, enterprises, and developers. By leveraging our advanced AI models, we enable users to achieve unprecedented levels of innovation and productivity. Our primary focus lies in enhancing model inference efficiency and accelerating progress in research through optimized inference capabilities.About the RoleWe are seeking talented engineers to expand and optimize OpenAI's inference infrastructure, specifically targeting emerging GPU platforms. This role encompasses a wide range of responsibilities from low-level kernel optimization to high-level distributed execution. You will collaborate closely with our research, infrastructure, and performance teams to ensure seamless operation of our largest models on cutting-edge hardware.This position offers a unique opportunity to influence and advance OpenAI’s multi-platform inference capabilities, with a strong emphasis on optimizing performance for AMD accelerators.Your Responsibilities Include:Overseeing the deployment, accuracy, and performance of the OpenAI inference stack on AMD hardware.Integrating our internal model-serving infrastructure (e.g., vLLM, Triton) into diverse GPU-backed systems.Debugging and optimizing distributed inference workloads across memory, network, and compute layers.Validating the correctness, performance, and scalability of model execution on extensive GPU clusters.Collaborating with partner teams to design and optimize high-performance GPU kernels for accelerators utilizing HIP, Triton, or other performance-centric frameworks.Working with partner teams to develop, integrate, and fine-tune collective communication libraries (e.g., RCCL) to parallelize model execution across multiple GPUs.Ideal Candidates Will:Possess experience in writing or porting GPU kernels using HIP, CUDA, or Triton, with a strong focus on low-level performance.Be familiar with communication libraries like NCCL/RCCL, understanding their importance in high-throughput model serving.Have experience with distributed inference systems and be adept at scaling models across multiple accelerators.Enjoy tackling end-to-end performance challenges across hardware, system libraries, and orchestration layers.Be eager to join a dynamic, agile team focused on building innovative infrastructure from the ground up.

Oct 8, 2025

Apply

Software Engineer for Cloud Inference Safeguards

Anthropic

Full-time|Remote|San Francisco, CA | Seattle, WA

Join our innovative team at Anthropic as a Software Engineer specializing in Cloud Inference Safeguards. In this role, you will play a crucial part in developing and enhancing the systems that ensure the robustness and security of our cloud-based inference services. You will collaborate with cross-functional teams to design, implement, and maintain scalable solutions that meet our high standards for reliability and performance.

Mar 27, 2026

Apply

Lead Staff Software Engineer - GenAI Inference

Databricks

Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California

P-1285 About This Role Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems. What You Will Do Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine. Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations. Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads. Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning. Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead. Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals. Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects. What We Look For A BS/MS/PhD in Computer Science or a related discipline. A solid software engineering background with 6+ years of experience in performance-critical systems. A proven ability to own complex system components and influence architectural decisions from conception to execution. A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.). A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).

Jan 30, 2026

Apply

Software Engineer - Inference Platform at Fluidstack | San Francisco

Fluidstack

Full-time|$165K/yr - $500K/yr|On-site|San Francisco, CA

Join the Fluidstack TeamAt Fluidstack, we’re pioneering the infrastructure for advanced intelligence. We collaborate with leading AI laboratories, governmental entities, and major corporations—including Mistral, Poolside, and Meta—to deliver computing solutions at unprecedented speeds.Our mission is to transform the vision of Artificial General Intelligence (AGI) into a reality. Driven by our purpose, our dedicated team is committed to building state-of-the-art infrastructure that prioritizes our customers' success. If you share our passion for excellence and are eager to contribute to the future of intelligence, we invite you to be part of our journey.Role OverviewThe Inference Platform team at Fluidstack is at the forefront of addressing the cost and latency challenges associated with frontier AI. You will play a crucial role in managing the serving layer that connects our global accelerator supply with the production workloads of our clients, which include LLM serving frameworks, KV cache infrastructure, and Kubernetes orchestration across multiple data centers.This hands-on individual contributor role combines elements of distributed systems, model optimization, and serving infrastructure. You will oversee the entire lifecycle of inference deployments for leading AI labs, striving for enhancements in throughput, cost-efficiency, and response times, while also influencing the architectural decisions that guide Fluidstack’s deployment strategies.

Mar 5, 2026

Apply

Software Engineer - GPU Inference at Baseten | San Francisco

Baseten

Full-time|On-site|San Francisco

Baseten develops infrastructure and tools that help AI companies deploy and scale inference. Teams at organizations like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer rely on Baseten to bring advanced machine learning models into production. The company recently secured a $300M Series E from investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Role overview This Software Engineer - GPU Inference position joins the founding team for Baseten Voice AI in San Francisco. The team focuses on building production-ready Voice AI systems, bringing open-source voice models into real-world use for clients in productivity, customer service, healthcare conversations, and education. The work shapes how people interact with technology through voice, creating broad impact across industries. In this role, the engineer leads the internal inference stack that powers Voice AI models. Responsibilities include guiding the product roadmap and driving engineering execution. Collaboration is a key part of the job, working closely with Forward Deployed Engineers, Model Performance Engineers, and other technical groups to advance Voice AI capabilities. Sample projects and initiatives The world's fastest Whisper, with streaming and diarization Canopy Labs selects Baseten for Orpheus TTS inference Partnering with the Core Product team to build an orchestration framework for a multi-model voice agent Working with the Training Platform team to support continuous training of voice models Designing a developer-friendly API and SDK for self-service adoption of Baseten Voice AI products

Apr 26, 2026

Apply

Applied AI Inference Engineer

Baseten

Full-time|$300K/yr - $300K/yr|On-site|San Francisco

ABOUT BASETENAt Baseten, we empower the leading AI companies of today, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer, by providing essential inference capabilities. Our unique blend of applied AI research, adaptable infrastructure, and intuitive developer tools enables innovators at the cutting edge of AI to seamlessly transition advanced models into production. With our recent success in securing a $300M Series E funding round, backed by notable investors such as BOND, IVP, Spark Capital, Greylock, and Conviction, we're on an exciting growth trajectory. Join our team and contribute to the platform that engineers rely on to launch AI-driven products.THE ROLEAs an Applied AI Inference Engineer at Baseten, you'll collaborate closely with clients to design, develop, and implement high-performance AI applications using our platform. You will guide customers through the entire process, from initial concept to deployment, transforming vague business objectives into dependable, observable solutions that meet defined quality, latency, and cost metrics.This position is ideal for innovative engineers eager to gain insight into how modern organizations scale AI adoption. You will thrive if you enjoy a multifaceted role that intersects product development, software engineering, performance optimization, and direct customer engagement.It’s essential to note that this position requires hands-on coding and software development, while also encompassing elements of product management, technical customer success, and pre-sales engineering.EXAMPLE INITIATIVESExplore insights from our Forward Deployed Engineering team through these blog posts: Forward Deployed Engineering on the frontier of AIThe fastest, most accurate Whisper transcriptionDeploy production-ready model servers from Docker imagesDeploy custom ComfyUI workflows as APIs...

Nov 4, 2025

Apply

Software Engineer, Inference Deployment

Anthropic

On-site|On-site|San Francisco, CA | New York City, NY | Seattle, WA

Feb 10, 2026

Apply

Technical Staff Member - Inference Engineering

Inferact

Full-time|$200K/yr - $400K/yr|Remote|San Francisco

At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.

Jan 22, 2026

Apply

Staff Software Engineer, Inference Infrastructure

Cohere

Full-Time|On-site|San Francisco

Who are we?At Cohere, our mission is to elevate intelligence to benefit humanity. We specialize in training and deploying cutting-edge models for developers and enterprises focused on creating AI systems that deliver extraordinary experiences such as content generation, semantic search, retrieval-augmented generation, and intelligent agents. We view our work as pivotal to the broad acceptance of AI technologies.We are passionate about our creations. Every team member plays a vital role in enhancing our models' capabilities and the value they provide to our customers. We thrive on hard work and speed, always prioritizing our clients' needs.Cohere is a diverse team of researchers, engineers, designers, and more, all dedicated to their craft. Each individual is a leading expert in their field, and we recognize that a variety of perspectives is essential to developing exceptional products.Join us in our mission and help shape the future of AI!Why this role?Are you excited about architecting high-performance, scalable, and reliable machine learning systems? Do you aspire to shape and construct the next generation of AI platforms that enhance advanced NLP applications? We are seeking talented Members of Technical Staff to join our Model Serving team at Cohere. This team is responsible for the development, deployment, and operation of our AI platform, which delivers Cohere's large language models via user-friendly API endpoints. In this role, you will collaborate with multiple teams to deploy optimized NLP models in production settings characterized by low latency, high throughput, and robust availability. Additionally, you will have the opportunity to work directly with customers to create tailored deployments that fulfill their unique requirements.

Jan 12, 2026

Apply

Software Engineer - GenAI Inference at Databricks | San Francisco

Databricks

Full-time|$142.2K/yr - $204.6K/yr|On-site|San Francisco, California

About This Role Join Databricks as a Software Engineer focused on GenAI inference, where you will play a pivotal role in designing, developing, and enhancing the inference engine that drives our Foundation Model API. Collaborating at the intersection of research and production, you will ensure our large language model (LLM) serving systems are optimized for speed, scalability, and efficiency. Your contributions will span the entire GenAI inference stack, from kernels and runtimes to orchestration and memory management. What You Will Do Participate in the design and implementation of the inference engine, collaborating on a model-serving stack tailored for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features such as sparsity, activation compression, and mixture-of-experts into the engine. Optimize latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Build and maintain tools for instrumentation, profiling, and tracing to identify bottlenecks and inform optimization efforts. Develop scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads. Ensure reliability, reproducibility, and fault tolerance in inference pipelines, including A/B launches, rollback, and model versioning. Integrate with federated and distributed inference infrastructure, orchestrating across nodes, balancing load, and managing communication overhead. Engage in cross-functional collaboration with platform engineers, cloud infrastructure, and security/compliance teams. Document and share insights, contributing to internal best practices and open-source initiatives as appropriate.

Jan 30, 2026

Apply

Staff + Senior Software Engineer, Cloud Inference

Anthropic

On-site|On-site|San Francisco, CA | New York City, NY | Seattle, WA

About AnthropicAt Anthropic, our mission is to develop AI systems that are safe, interpretable, and controllable. We believe in harnessing AI for the greater good of our users and society at large. Our dynamic team comprises dedicated researchers, engineers, policy experts, and business leaders who collaborate to create beneficial AI systems.About the RoleThe Cloud Inference team is responsible for scaling and optimizing Claude to cater to a vast array of developers and enterprise clients across platforms such as AWS, GCP, Azure, and future cloud service providers (CSPs). We manage the complete lifecycle of Claude on each cloud platform—from API integration and intelligent request routing to inference execution, capacity management, and daily operations.Our engineers wield significant influence, driving multiple key revenue streams while optimizing one of Anthropic's most valuable resources—compute power. As we expand to additional cloud providers, the intricacies of efficiently managing inference across diverse platforms with varying hardware, networking frameworks, and operational models grow substantially. We seek engineers adept at navigating these variances, developing strong abstractions that are effective across providers, and making informed infrastructure choices that keep us cost-effective at scale.Your contributions will enhance the operational scale of our services, expedite our capacity to launch cutting-edge models and innovative features to clients across all platforms, and ensure our large language models (LLMs) adhere to stringent safety, performance, and security standards.

Feb 5, 2026

Apply

Inference Engineer at Cartesia | San Francisco, CA

Cartesia

Full-time|On-site|*HQ - San Francisco, CA

Join Cartesia as an Inference EngineerAt Cartesia, our vision is to create the next evolution of AI: an interactive, omnipresent intelligence that operates seamlessly across all environments. Currently, even the most advanced models struggle to continuously analyze a year's worth of audio, video, and text data—comprising 1 billion text tokens, 10 billion audio tokens, and 1 trillion video tokens—much less perform these tasks on-device.We are at the forefront of developing the model architectures that will make this a reality. Our founding team, who met as PhD candidates at the Stanford AI Lab, pioneered State Space Models (SSMs), a groundbreaking framework for training efficient, large-scale foundation models. Our talented team merges deep expertise in model innovation and systems engineering with a design-focused product engineering approach, enabling us to build and launch state-of-the-art models and user experiences.Supported by leading investors such as Index Ventures and Lightspeed Venture Partners, along with contributions from Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks, and others, we are fortunate to be guided by numerous exceptional advisors and over 90 angel investors from diverse industries, including some of the world’s foremost experts in AI.About the RoleWe are actively seeking an Inference Engineer to propel our mission of creating real-time multimodal intelligence.Your ImpactDevelop and implement a low-latency, scalable, and dependable model inference and serving stack for our innovative foundation models utilizing Transformers, SSMs, and hybrid models.Collaborate closely with our research team and product engineers to efficiently deliver our product suite in a fast, cost-effective, and reliable manner.Construct robust inference infrastructure and monitoring systems for our product offerings.Enjoy substantial autonomy in shaping our products and directly influencing how cutting-edge AI is integrated across diverse devices and applications.What You BringAt Cartesia, we prioritize strong engineering skills due to the complexity and scale of the challenges we tackle.Proficient engineering skills with a comfort level in navigating intricate codebases, and a commitment to producing clean, maintainable code.Experience in developing large-scale distributed systems with strict performance, reliability, and observability requirements.Proven technical leadership, capable of executing and delivering results from zero to one amidst uncertainty.A background in or experience with inference pipelines, machine learning, and generative models.

Dec 12, 2024

Apply

Forward Deployed Engineer - Systems

Modal

FullTime|On-site|San Francisco

About Us:At Modal, we empower AI teams with a robust infrastructure. Our platform offers instant GPU access, lightning-fast container startups, and seamless storage solutions, enabling teams to effortlessly train models, execute batch jobs, and deliver low-latency inferences. Renowned companies like Suno, Lovable, and Substack trust Modal to transition from prototype to production, free from the complexities of infrastructure management.As a rapidly expanding organization with offices in NYC, SF, and Stockholm, we have achieved a remarkable 9-figure ARR and recently secured Series B funding at a valuation of $1.1B. Our thousands of clients, including Lovable, Scale AI, Substack, and Suno, depend on us for their production AI workloads.Joining Modal means becoming part of one of the fastest-growing AI infrastructure companies in its early stages, with abundant opportunities for personal and professional growth. Our diverse team comprises creators of celebrated open-source projects (e.g., Seaborn, Luigi), esteemed academic researchers, international olympiad medalists, and seasoned leaders in engineering and product development.The Role:Modal is on the lookout for a skilled Forward Deployed Engineer (FDE) to collaborate closely with our sales team and propel our technical sales initiatives. In this role, you will serve as the technical authority throughout the sales process, working in tandem with Account Executives to help enterprise clients grasp how Modal can revolutionize their AI/ML infrastructure. Key responsibilities include:Collaborate with Account Executives to pinpoint, qualify, and secure strategic enterprise opportunities.Lead technical discovery sessions with prospective clients to assess their existing infrastructure, challenges, and requirements.Craft and deliver persuasive technical solutions that illustrate how Modal meets customer needs.Design migration pathways from current cloud infrastructures (AWS, GCP, Azure) to Modal's serverless architecture.Conduct technical demonstrations, experiments, and proof-of-concept projects that highlight Modal's capabilities.Manage complex technical evaluations while addressing security, compliance, and integration challenges.Establish trusted advisor relationships with key technical stakeholders including CTOs, VPs of Engineering, and ML Engineering leads.

Sep 23, 2025

Apply

Inference Technical Lead, Sora

OpenAI

Full-time|Hybrid|San Francisco

Join the Sora Team at OpenAIThe Sora team is at the forefront of developing multimodal capabilities within OpenAI’s foundational models. We are a dynamic blend of research and product development, committed to integrating sophisticated multimodal functionalities into our AI offerings. Our focus is on delivering solutions that are not only reliable and intuitive but also resonate with our mission to foster broad societal benefits.Your Role as Inference Technical LeadWe are seeking a talented GPU Inference Engineer to enhance the model serving efficiency for Sora. This pivotal position will empower you to spearhead initiatives aimed at optimizing inference performance and scalability. You will collaborate closely with our researchers to design and develop models that are optimized for inference, directly contributing to the success of our projects.Your contributions will be vital in advancing the team’s overarching objectives, allowing leadership to concentrate on high-impact initiatives by establishing a robust technical foundation.Key Responsibilities:Enhance model serving, inference performance, and overall system efficiency through focused engineering efforts.Implement optimizations targeting kernel and data movement to boost system throughput and reliability.Collaborate with research and product teams to ensure our models operate effectively at scale.Design, construct, and refine essential serving infrastructure to meet Sora’s growth and reliability demands.You Will Excel in This Role If You:Possess deep knowledge in model performance optimization, particularly at the inference level.Have a strong foundation in kernel-level systems, data movement, and low-level performance tuning.Are passionate about scaling high-performing AI systems that address real-world, multimodal challenges.Thrive in ambiguous situations, setting technical direction, and driving complex projects to fruition.This role is based in San Francisco, CA. We follow a hybrid work model requiring 3 in-office days per week and offer relocation assistance to new hires.

Apr 21, 2025

Apply

Developer Relations Engineer

Modal

FullTime|On-site|San Francisco

About Us:Modal is revolutionizing the AI infrastructure landscape. By offering instant GPU access, rapid container startups, and integrated storage solutions, we empower AI teams to seamlessly train models, execute batch jobs, and enable low-latency inference. Leading companies such as Suno, Lovable, and Substack trust Modal to transition from prototype to production without the complexities of infrastructure management.As a rapidly expanding company with teams in NYC, San Francisco, and Stockholm, we have achieved a remarkable 9-figure ARR and recently secured a Series B funding at a valuation of $1.1 billion. Thousands of clients, including Lovable, Scale AI, Substack, and Suno, depend on us for their production AI workloads.Joining Modal means becoming a part of one of the fastest-growing AI infrastructure organizations at a pivotal stage, with immense opportunities for professional development. Our diverse team comprises creators of renowned open-source projects (e.g. Seaborn, Luigi), esteemed academic researchers, international medalists, and seasoned engineering and product leaders with decades of expertise.The Role:At Modal, we craft AI infrastructure products that resonate with developers, driving our rapid growth and ensuring that word of mouth remains a vital channel for us.In this pivotal role, you will focus on creating and disseminating technical content that is not only distinctive and informative but also practical. This content will often be the first interaction many users have with Modal. We aspire to highlight the capabilities and developer-centric experience of our platform while serving as a reliable resource for users exploring new AI technologies.In this role, your responsibilities will include:Translating the latest advancements in AI technology into accessible educational materials for developers.Delivering demonstrations and presentations about Modal and related tools at developer-centric events.Actively engaging with our user community across various online platforms (X, LinkedIn, Reddit, Slack) and participating in live events.Cultivating partnerships, integrations, and co-marketing initiatives with other developer-focused organizations.Establishing objectives aligned with the broader Go-To-Market team and evaluating the impact of your initiatives.Requirements:We are seeking a candidate who possesses:3+ years of experience in developer relations, technical writing, or a related field.Strong understanding of AI technologies and the ability to communicate complex concepts clearly.Excellent presentation and public speaking skills.Proficiency in engaging and building relationships within developer communities.Ability to work collaboratively in a fast-paced environment and adapt to changing priorities.

Sep 4, 2025

Apply

Senior Engineer 2: Inference Optimizations

DigitalOcean, Inc.

Full-time|Remote|San Francisco

Join DigitalOcean as a Senior Engineer focused on Inference Optimizations, where you will play a pivotal role in enhancing our AI and machine learning capabilities. Collaborate with a talented team to develop cutting-edge solutions that optimize inference processes across various applications.

Mar 17, 2026

Create account — see all 5,501 results