Software Engineer Inference Amd Gpu Enablement jobs in San Francisco – Browse 5,578 openings on RoboApply Jobs

Software Engineer Inference Amd Gpu Enablement jobs in San Francisco

Open roles matching “Software Engineer Inference Amd Gpu Enablement” with location signals for San Francisco. 5,578 active listings on RoboApply Jobs.

5,578 jobs found

1 - 20 of 5,578 Jobs
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

About Our TeamThe Inference team at OpenAI is dedicated to translating our cutting-edge research into accessible, transformative technology for consumers, enterprises, and developers. By leveraging our advanced AI models, we enable users to achieve unprecedented levels of innovation and productivity. Our primary focus lies in enhancing model inference efficiency and accelerating progress in research through optimized inference capabilities.About the RoleWe are seeking talented engineers to expand and optimize OpenAI's inference infrastructure, specifically targeting emerging GPU platforms. This role encompasses a wide range of responsibilities from low-level kernel optimization to high-level distributed execution. You will collaborate closely with our research, infrastructure, and performance teams to ensure seamless operation of our largest models on cutting-edge hardware.This position offers a unique opportunity to influence and advance OpenAI’s multi-platform inference capabilities, with a strong emphasis on optimizing performance for AMD accelerators.Your Responsibilities Include:Overseeing the deployment, accuracy, and performance of the OpenAI inference stack on AMD hardware.Integrating our internal model-serving infrastructure (e.g., vLLM, Triton) into diverse GPU-backed systems.Debugging and optimizing distributed inference workloads across memory, network, and compute layers.Validating the correctness, performance, and scalability of model execution on extensive GPU clusters.Collaborating with partner teams to design and optimize high-performance GPU kernels for accelerators utilizing HIP, Triton, or other performance-centric frameworks.Working with partner teams to develop, integrate, and fine-tune collective communication libraries (e.g., RCCL) to parallelize model execution across multiple GPUs.Ideal Candidates Will:Possess experience in writing or porting GPU kernels using HIP, CUDA, or Triton, with a strong focus on low-level performance.Be familiar with communication libraries like NCCL/RCCL, understanding their importance in high-throughput model serving.Have experience with distributed inference systems and be adept at scaling models across multiple accelerators.Enjoy tackling end-to-end performance challenges across hardware, system libraries, and orchestration layers.Be eager to join a dynamic, agile team focused on building innovative infrastructure from the ground up.

Oct 8, 2025
Apply
companyBaseten logo
Full-time|On-site|San Francisco

Baseten develops infrastructure and tools that help AI companies deploy and scale inference. Teams at organizations like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer rely on Baseten to bring advanced machine learning models into production. The company recently secured a $300M Series E from investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Role overview This Software Engineer - GPU Inference position joins the founding team for Baseten Voice AI in San Francisco. The team focuses on building production-ready Voice AI systems, bringing open-source voice models into real-world use for clients in productivity, customer service, healthcare conversations, and education. The work shapes how people interact with technology through voice, creating broad impact across industries. In this role, the engineer leads the internal inference stack that powers Voice AI models. Responsibilities include guiding the product roadmap and driving engineering execution. Collaboration is a key part of the job, working closely with Forward Deployed Engineers, Model Performance Engineers, and other technical groups to advance Voice AI capabilities. Sample projects and initiatives The world's fastest Whisper, with streaming and diarization Canopy Labs selects Baseten for Orpheus TTS inference Partnering with the Core Product team to build an orchestration framework for a multi-model voice agent Working with the Training Platform team to support continuous training of voice models Designing a developer-friendly API and SDK for self-service adoption of Baseten Voice AI products

Apr 26, 2026
Apply
companyOpenAI logo
Full-time|Hybrid|San Francisco

Join the Sora Team at OpenAIThe Sora team is at the forefront of developing multimodal capabilities within OpenAI’s foundational models. We are a dynamic blend of research and product development, committed to integrating sophisticated multimodal functionalities into our AI offerings. Our focus is on delivering solutions that are not only reliable and intuitive but also resonate with our mission to foster broad societal benefits.Your Role as Inference Technical LeadWe are seeking a talented GPU Inference Engineer to enhance the model serving efficiency for Sora. This pivotal position will empower you to spearhead initiatives aimed at optimizing inference performance and scalability. You will collaborate closely with our researchers to design and develop models that are optimized for inference, directly contributing to the success of our projects.Your contributions will be vital in advancing the team’s overarching objectives, allowing leadership to concentrate on high-impact initiatives by establishing a robust technical foundation.Key Responsibilities:Enhance model serving, inference performance, and overall system efficiency through focused engineering efforts.Implement optimizations targeting kernel and data movement to boost system throughput and reliability.Collaborate with research and product teams to ensure our models operate effectively at scale.Design, construct, and refine essential serving infrastructure to meet Sora’s growth and reliability demands.You Will Excel in This Role If You:Possess deep knowledge in model performance optimization, particularly at the inference level.Have a strong foundation in kernel-level systems, data movement, and low-level performance tuning.Are passionate about scaling high-performing AI systems that address real-world, multimodal challenges.Thrive in ambiguous situations, setting technical direction, and driving complex projects to fruition.This role is based in San Francisco, CA. We follow a hybrid work model requiring 3 in-office days per week and offer relocation assistance to new hires.

Apr 21, 2025
Apply
companyPulse logo
Full-time|On-site|San Francisco

OverviewAt Pulse, we are revolutionizing the way data infrastructure operates by addressing the critical challenge of accurately extracting structured information from intricate documents on a large scale. Our innovative document understanding technique merges intelligent schema mapping with advanced extraction models, outperforming traditional OCR and parsing methods.Located in the heart of San Francisco, we are a dynamic team of engineers dedicated to empowering Fortune 100 enterprises, YC startups, public investment firms, and growth-stage companies. Backed by top-tier investors, we are rapidly expanding our footprint in the industry.What sets our technology apart is our sophisticated multi-stage architecture, which includes:Specialized models for layout understanding and component detectionLow-latency OCR models designed for precise extractionAdvanced algorithms for reading-order in complex document structuresProprietary methods for table structure recognition and parsingFine-tuned vision-language models for interpreting charts, tables, and figuresIf you possess a strong passion for the convergence of computer vision, natural language processing, and data infrastructure, your contributions at Pulse will significantly impact our clients and help shape the future of document intelligence.

Jul 30, 2025
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

About Our TeamJoin the Fleet team at OpenAI, where we empower groundbreaking research and product innovation through our advanced computing infrastructure. We manage extensive systems across data centers, GPUs, and networking, ensuring optimal performance, high availability, and efficiency. Our work is crucial in enabling OpenAI’s models to function seamlessly at scale, supporting both our internal research endeavors and external products like ChatGPT. We are committed to prioritizing safety, reliability, and the ethical deployment of AI technology.About the RoleAs a Software Engineer on the Fleet High Performance Computing (HPC) team, you will play a vital role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for smooth research training progress and uninterrupted services, as even minor hardware issues can lead to significant setbacks. With the rise of large supercomputers, the stakes in maintaining efficiency and stability have never been higher.At the cutting edge of technology, we often lead the charge in troubleshooting complex, state-of-the-art systems at scale. This is a unique opportunity for you to engage with groundbreaking technologies and create innovative solutions that enhance the health and efficiency of our supercomputing infrastructure.Our team fosters a culture of autonomy and ownership, enabling skilled engineers to drive meaningful change. In this role, you will focus on comprehensive system investigations and develop automated solutions to enhance our operations. We seek individuals who dive deep into challenges, conduct thorough investigations, and create scalable automation for detection and remediation.Key Responsibilities:Develop and maintain automation systems for provisioning and managing server fleets.Create tools to monitor server health, performance metrics, and lifecycle events.Collaborate effectively with teams across clusters, networking, and infrastructure.Work closely with external operators to maintain a high level of service quality.Identify and resolve performance bottlenecks and inefficiencies in the system.Continuously enhance automation processes to minimize manual intervention.You Will Excel in This Role if You Have:Experience in managing large-scale server environments.A blend of technical skills in systems programming and infrastructure management.Strong problem-solving abilities and a methodical approach to troubleshooting.Familiarity with high-performance computing technologies and tools.

Feb 5, 2026
Apply
company
Full-time|On-site|San Francisco

Join Prima Mente: A Leader in Biology AIAt Prima Mente, we are redefining the frontier of biology through artificial intelligence. Our mission is to generate unique datasets, develop versatile biological foundation models, and translate groundbreaking discoveries into impactful research and clinical outcomes. With a commitment to understanding the complexities of the brain, we aim to shield it from neurological diseases while enhancing overall health. Our diverse team of AI researchers, experimentalists, clinicians, and operational experts operates across London, San Francisco, and Dubai.Role Overview: GPU/CPU-Accelerated BioinformaticsWe are seeking a skilled Bioinformatics Software Engineer to architect and implement scalable production pipelines for processing multi-omics data. The successful candidate will enable rapid transitions from hypothesis to patent-ready solutions in a matter of months.Key Responsibilities:Design and implement bioinformatics pipelines optimized for GPU/CPU utilization utilizing tools like Flyte and Nextflow, capable of processing over 1,000 samples at scale.Optimize performance and cost efficiency by leveraging GPU/CPU acceleration where it provides the greatest benefit.Collaborate with experimental and machine learning teams to validate computational results and align processing with model requirements.Foster and manage collaborations with academic and industrial research partners.Growth Expectations1 Month: You will be deploying your workflows on GPU/CPU-accelerated cloud infrastructure to process multi-omic experiments, while building relationships with AI/ML and wet lab teams to understand their requirements.3 Months: Your optimized pipelines will be processing thousands of samples with substantial speed enhancements and reduced costs, yielding publication and patent-ready outcomes.6 Months: Your automated pipelines will support daily AI model training, and you will co-design experiments alongside AI/ML engineers, leading technical execution on external collaborations.Your ProfileYou are passionate about pushing the boundaries of AI and biology. As an engineer rather than an analyst, you thrive on enhancing performance and efficiency while architecting robust systems. You are comfortable making rapid technical decisions and iterate quickly.Desired QualificationsExperience in bioinformatics, computational biology, or a related field.Proficiency in software engineering, particularly in developing scalable data processing pipelines.Strong understanding of multi-omics data and methods.Familiarity with GPU/CPU acceleration techniques.Excellent communication and collaboration skills.

Mar 2, 2026
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

About Our TeamJoin the Inference team at OpenAI, where we leverage cutting-edge research and technology to deliver exceptional AI products to consumers, enterprises, and developers. Our mission is to empower users to harness the full potential of our advanced AI models, enabling unprecedented capabilities. We prioritize efficient and high-performance model inference while accelerating research advancements.About the RoleWe are seeking a passionate Software Engineer to optimize some of the world's largest and most sophisticated AI models for deployment in high-volume, low-latency, and highly available production and research environments.Key ResponsibilitiesCollaborate with machine learning researchers, engineers, and product managers to transition our latest technologies into production.Work closely with researchers to enable advanced research initiatives through innovative engineering solutions.Implement new techniques, tools, and architectures that enhance the performance, latency, throughput, and effectiveness of our model inference stack.Develop tools to identify bottlenecks and instability sources, designing and implementing solutions for priority issues.Optimize our code and Azure VM fleet to maximize every FLOP and GB of GPU RAM available.You Will Excel in This Role If You:Possess a solid understanding of modern machine learning architectures and an intuitive grasp of performance optimization strategies, especially for inference.Take ownership of problems end-to-end, demonstrating a willingness to acquire any necessary knowledge to achieve results.Bring at least 5 years of professional software engineering experience.Have or can quickly develop expertise in PyTorch, NVidia GPUs, and relevant optimization software stacks (such as NCCL, CUDA), along with HPC technologies like InfiniBand, MPI, and NVLink.Have experience in architecting, building, monitoring, and debugging production distributed systems, with bonus points for working on performance-critical systems.Have successfully rebuilt or significantly refactored production systems multiple times to accommodate rapid scaling.Are self-driven, enjoying the challenge of identifying and addressing the most critical problems.

Feb 6, 2025
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

About Our TeamJoin OpenAI’s dynamic Inference team, where we empower the deployment of cutting-edge AI models, including our renowned GPT models, advanced Image Generation capabilities, and Whisper, across diverse platforms. Our mission is to ensure these models are not only high-performing and scalable but also available for real-world applications. Collaborating closely with our Research team, we’re committed to bringing the next generation of AI innovations to fruition. As a compact, agile team, we prioritize delivering an exceptional developer experience while continuously pushing the frontiers of artificial intelligence.As we expand our focus into multimodal inference, we are building the necessary infrastructure to support models that process images, audio, and other non-text modalities. This work involves tackling diverse model sizes and interactions, managing complex input/output formats, and ensuring seamless collaboration between product and research teams.About The RoleWe are seeking a passionate Software Engineer to aid in the large-scale deployment of OpenAI’s multimodal models. You will join a small yet impactful team dedicated to creating robust, high-performance infrastructure for real-time audio, image, and various multimodal workloads in production environments.This position is inherently collaborative; you will work directly with researchers who develop these models and with product teams to define novel interaction modalities. Your contributions will enable users to generate speech, interpret images, and engage with models in innovative ways that extend beyond traditional text-based interactions.Key Responsibilities:Design and implement advanced inference infrastructure for large-scale multimodal models.Optimize systems for high-throughput and low-latency processing of image and audio inputs and outputs.Facilitate the transition of experimental research workflows into dependable production services.Engage closely with researchers, infrastructure teams, and product engineers to deploy state-of-the-art capabilities.Contribute to systemic enhancements, including GPU utilization, tensor parallelism, and hardware abstraction layers.You May Excel In This Role If You:Have a proven track record of building and scaling inference systems for large language models or multimodal architectures.Possess experience with GPU-based machine learning workloads and a solid understanding of the performance dynamics associated with large models, particularly with intricate data types like images or audio.Thrive in a fast-paced, experimental environment and enjoy collaborating with cross-functional teams to drive impactful results.

May 21, 2025
Apply
companyBaseten logo
Full-time|On-site|San Francisco

Join Baseten as a Software Engineer focusing on GPU Networking and Distributed Systems. In this pivotal role, you'll collaborate with talented engineers and researchers to develop cutting-edge solutions that leverage GPU technology for high-performance networking operations. Your contributions will be instrumental in shaping the future of distributed systems, enhancing performance, scalability, and reliability.

Feb 23, 2026
Apply
companyAnthropic logo
Full-time|Remote|San Francisco, CA | Seattle, WA

Join our innovative team at Anthropic as a Software Engineer specializing in Cloud Inference Safeguards. In this role, you will play a crucial part in developing and enhancing the systems that ensure the robustness and security of our cloud-based inference services. You will collaborate with cross-functional teams to design, implement, and maintain scalable solutions that meet our high standards for reliability and performance.

Mar 27, 2026
Apply
companyDatabricks logo
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California

P-1285 About This Role Join Databricks as a Staff Software Engineer specializing in GenAI inference, where you will spearhead the architecture, development, and optimization of the inference engine that powers the Databricks Foundation Model API. Your role will be crucial in bridging cutting-edge research with real-world production requirements, ensuring exceptional throughput, minimal latency, and scalable solutions. You will work across the entire GenAI inference stack, including kernels, runtimes, orchestration, memory management, and integration with various frameworks and orchestration systems. What You Will Do Take full ownership of the architecture, design, and implementation of the inference engine, collaborating on a model-serving stack optimized for large-scale LLM inference. Work closely with researchers to integrate new model architectures or features, such as sparsity, activation compression, and mixture-of-experts into the engine. Lead comprehensive optimization efforts focused on latency, throughput, memory efficiency, and hardware utilization across GPUs and other accelerators. Establish and uphold standards for building and maintaining instrumentation, profiling, and tracing tools to identify performance bottlenecks and drive optimizations. Design scalable solutions for routing, batching, scheduling, memory management, and dynamic loading tailored to inference workloads. Guarantee reliability, reproducibility, and fault tolerance in inference pipelines, including capabilities for A/B testing, rollbacks, and model versioning. Collaborate cross-functionally to integrate with federated and distributed inference infrastructure, ensuring effective orchestration across nodes, load balancing, and minimizing communication overhead. Foster collaboration with cross-functional teams, including platform engineers, cloud infrastructure, and security/compliance professionals. Represent the team externally through benchmarks, whitepapers, and contributions to open-source projects. What We Look For A BS/MS/PhD in Computer Science or a related discipline. A solid software engineering background with 6+ years of experience in performance-critical systems. A proven ability to own complex system components and influence architectural decisions from conception to execution. A deep understanding of ML inference internals, including attention mechanisms, MLPs, recurrent modules, quantization, and sparse operations. Hands-on experience with CUDA, GPU programming, and essential libraries (cuBLAS, cuDNN, NCCL, etc.). A strong foundation in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, and memory partitioning. Demonstrated proficiency in diagnosing and resolving performance bottlenecks across multiple layers (kernel, memory, networking, scheduler).

Jan 30, 2026
Apply
companyFluidstack logo
Full-time|$165K/yr - $500K/yr|On-site|San Francisco, CA

Join the Fluidstack TeamAt Fluidstack, we’re pioneering the infrastructure for advanced intelligence. We collaborate with leading AI laboratories, governmental entities, and major corporations—including Mistral, Poolside, and Meta—to deliver computing solutions at unprecedented speeds.Our mission is to transform the vision of Artificial General Intelligence (AGI) into a reality. Driven by our purpose, our dedicated team is committed to building state-of-the-art infrastructure that prioritizes our customers' success. If you share our passion for excellence and are eager to contribute to the future of intelligence, we invite you to be part of our journey.Role OverviewThe Inference Platform team at Fluidstack is at the forefront of addressing the cost and latency challenges associated with frontier AI. You will play a crucial role in managing the serving layer that connects our global accelerator supply with the production workloads of our clients, which include LLM serving frameworks, KV cache infrastructure, and Kubernetes orchestration across multiple data centers.This hands-on individual contributor role combines elements of distributed systems, model optimization, and serving infrastructure. You will oversee the entire lifecycle of inference deployments for leading AI labs, striving for enhancements in throughput, cost-efficiency, and response times, while also influencing the architectural decisions that guide Fluidstack’s deployment strategies.

Mar 5, 2026
Apply
companyBaseten logo
Full-time|$300K/yr - $300K/yr|On-site|San Francisco

ABOUT BASETENAt Baseten, we empower the leading AI companies of today, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer, by providing essential inference capabilities. Our unique blend of applied AI research, adaptable infrastructure, and intuitive developer tools enables innovators at the cutting edge of AI to seamlessly transition advanced models into production. With our recent success in securing a $300M Series E funding round, backed by notable investors such as BOND, IVP, Spark Capital, Greylock, and Conviction, we're on an exciting growth trajectory. Join our team and contribute to the platform that engineers rely on to launch AI-driven products.THE ROLEAs an Applied AI Inference Engineer at Baseten, you'll collaborate closely with clients to design, develop, and implement high-performance AI applications using our platform. You will guide customers through the entire process, from initial concept to deployment, transforming vague business objectives into dependable, observable solutions that meet defined quality, latency, and cost metrics.This position is ideal for innovative engineers eager to gain insight into how modern organizations scale AI adoption. You will thrive if you enjoy a multifaceted role that intersects product development, software engineering, performance optimization, and direct customer engagement.It’s essential to note that this position requires hands-on coding and software development, while also encompassing elements of product management, technical customer success, and pre-sales engineering.EXAMPLE INITIATIVESExplore insights from our Forward Deployed Engineering team through these blog posts: Forward Deployed Engineering on the frontier of AIThe fastest, most accurate Whisper transcriptionDeploy production-ready model servers from Docker imagesDeploy custom ComfyUI workflows as APIs...

Nov 4, 2025
Apply
companyOpenAI logo
Full-time|On-site|San Francisco

Join our dynamic team at OpenAI as a Software Engineer specializing in Workload Enablement. In this pivotal role, you will be responsible for designing and implementing innovative solutions that enhance our operational efficiency and performance.Your expertise will contribute to building scalable systems that empower our teams to harness the full potential of AI technologies. Collaborate with cross-functional teams to identify challenges and deliver cutting-edge solutions that drive our mission forward.

Mar 28, 2026
Apply
companyAnthropic logo
On-site|On-site|San Francisco, CA | New York City, NY | Seattle, WA

Join Anthropic as a Software Engineer on our Launch Engineering team, where your focus will be on designing and building cutting-edge deployment infrastructure for inference code. You will ensure our AI models, at scale, are continuously and seamlessly deployed to production. This role is pivotal in optimizing resource management while maximizing deployment efficiency. Your expertise will be essential in navigating complex deployment challenges, validating systems, and ensuring minimal disruption to our user services. If you thrive in tackling ambitious problems at the intersection of automation and resource management, this position offers the opportunity to make a significant impact.

Feb 10, 2026
Apply
companyGenmo logo
Full-time|On-site|San Francisco HQ

At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.

Jul 17, 2025
Apply
companySciforium logo
Full-time|On-site|San Francisco

At Sciforium, we are at the forefront of AI infrastructure, innovating next-generation multimodal AI models and a proprietary high-efficiency serving platform. With substantial funding and direct collaboration from AMD, supported by their engineers, our team is rapidly expanding to develop the complete stack that powers cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented GPU Kernel Engineer who is eager to explore and maximize performance on modern accelerators. In this role, you will be responsible for designing and optimizing custom GPU kernels that drive our advanced large-scale AI systems. You will navigate the hardware-software stack, engaging in low-level kernel development and integrating optimized operations into high-level machine learning frameworks for large-scale training and inference.This position is perfect for someone who excels at the intersection of GPU programming, systems engineering, and state-of-the-art AI workloads, and aims to contribute significantly to the efficiency and scalability of our machine learning platform.Key ResponsibilitiesDevelop, implement, and enhance custom GPU kernels utilizing C++, PTX, CUDA, ROCm, Triton, and/or JAX Pallas.Profile and fine-tune the end-to-end performance of machine learning operations, particularly for large-scale LLM training and inference.Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and our proprietary internal runtimes.Create performance models, pinpoint bottlenecks, and deliver kernel-level enhancements that significantly boost AI workloads.Collaborate with machine learning researchers, distributed systems engineers, and model-serving teams to optimize computational performance across the entire stack.Engage closely with hardware vendors (NVIDIA/AMD) and stay updated on the latest GPU architecture and compiler/toolchain advancements.Contribute to the development of tools, documentation, benchmarking suites, and testing frameworks ensuring correctness and performance reproducibility.Must-Haves5+ years of industry or research experience in GPU kernel development or high-performance computing.Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related discipline.Strong programming proficiency in C++, Python, and familiarity with machine learning frameworks.

Dec 6, 2025
Apply
companyInferact logo
Full-time|$200K/yr - $400K/yr|Remote|San Francisco

At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.

Jan 22, 2026
Apply
companyBenchling logo
Full-time|On-site|San Francisco, CA

At Benchling, we are transforming the biotech landscape for the AI era.When scientific breakthroughs are stalled, the world holds its breath. The journey from molecule discovery to patient treatment, or from laboratory innovations to agricultural fields, encompasses countless slow, manual, and uncoordinated procedures. AI holds the promise to revolutionize this process, significantly reducing R&D timelines from decades to mere years. However, this transformation is only achievable when clean, structured scientific data and AI are seamlessly integrated into scientific methodologies.Benchling serves as the premier AI platform for biotech R&D. Our platform empowers over 200,000 scientists globally, facilitating experiment design, structured data capture, and the execution of AI models directly within their workflows. Leading organizations, including Sanofi, Moderna, and more than half of the top 50 biopharmaceutical companies worldwide, trust Benchling to enhance their most critical research initiatives.ROLE OVERVIEWAs a Software Engineer on the Developer Enablement team, you will be instrumental in constructing the platform that empowers our network of scientific applications. Your role extends beyond mere coding; you will influence how developers, partners, and scientists engage with Benchling’s products, enabling the creation of tools that directly bolster groundbreaking research. Your contributions will redefine scientific workflows, fostering innovation and ensuring the seamless integration of third-party solutions across our platform.The team is focused on various early- to mid-stage initiatives, including developer APIs, event delivery systems, AI-powered agents, and enhanced platform infrastructure to support a burgeoning ecosystem of scientific applications. This diversity allows engineers the flexibility to pursue projects that align with both business objectives and their personal interests and skills. This role is envisioned as a “Developer Platform Engineer,” highlighting the expansive opportunities to contribute across our developer-facing systems.This role is hands-on and high-impact, where your contributions will be visible throughout the company and to our scientific users around the globe. You will collaborate closely with product managers, designers, and fellow engineers, tackling complex challenges while advancing your expertise in platform development and life sciences.

Apr 8, 2026
Apply
companyCohere logo
Full-Time|On-site|San Francisco

Who are we?At Cohere, our mission is to elevate intelligence to benefit humanity. We specialize in training and deploying cutting-edge models for developers and enterprises focused on creating AI systems that deliver extraordinary experiences such as content generation, semantic search, retrieval-augmented generation, and intelligent agents. We view our work as pivotal to the broad acceptance of AI technologies.We are passionate about our creations. Every team member plays a vital role in enhancing our models' capabilities and the value they provide to our customers. We thrive on hard work and speed, always prioritizing our clients' needs.Cohere is a diverse team of researchers, engineers, designers, and more, all dedicated to their craft. Each individual is a leading expert in their field, and we recognize that a variety of perspectives is essential to developing exceptional products.Join us in our mission and help shape the future of AI!Why this role?Are you excited about architecting high-performance, scalable, and reliable machine learning systems? Do you aspire to shape and construct the next generation of AI platforms that enhance advanced NLP applications? We are seeking talented Members of Technical Staff to join our Model Serving team at Cohere. This team is responsible for the development, deployment, and operation of our AI platform, which delivers Cohere's large language models via user-friendly API endpoints. In this role, you will collaborate with multiple teams to deploy optimized NLP models in production settings characterized by low latency, high throughput, and robust availability. Additionally, you will have the opportunity to work directly with customers to create tailored deployments that fulfill their unique requirements.

Jan 12, 2026

Sign in to browse more jobs

Create account — see all 5,578 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.