Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Experience
Qualifications
Required QualificationsBachelor's degree or equivalent experience in computer science, engineering, or a related field. In-depth understanding of transformer architectures and their derivatives. Proficient programming skills in Python, with a strong background in PyTorch internals. Experience with LLM inference systems (e.g., vLLM, TensorRT-LLM, SGLang, TGI). Ability to interpret and implement model architectures and inference techniques as presented in academic papers. Proven capability to produce high-performance, maintainable code and troubleshoot complex machine learning codebases. Preferred QualificationsComprehensive knowledge of KV-cache memory management, prefix caching, and hybrid model serving. Familiarity with reinforcement learning frameworks and algorithms for large language models. Experience in multimodal inference across various media types (audio, image, video, text). Previous contributions to open-source machine learning or systems infrastructure projects. Additionally, bonus points if you have:Successfully implemented core features in vLLM or other inference engine projects. Contributed to vLLM integrations (e.g., verl, OpenRLHF, Unsloth, LlamaFactory). Authored widely-shared technical blogs or side projects focusing on vLLM or LLM inference.
About the job
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.
Role Overview
We are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
About Inferact
Inferact is dedicated to advancing the field of artificial intelligence through innovative solutions in inference technology. Our team, comprised of the original architects of vLLM, is committed to shaping the future of AI by creating tools that make inference faster and more cost-effective.
Similar jobs
1 - 20 of 5,932 Jobs
Search for Performance Engineer Member Of Technical Staff Kernel Engineering
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, significantly enhancing the speed and reducing the cost of AI inference. Our founders, the visionaries behind vLLM, have spent years bridging the gap between advanced models and cutting-edge hardware.About the RoleWe are seeking a skilled performance engineer dedicated to maximizing the computational efficiency of modern accelerators. In this role, you'll develop kernels and implement low-level optimizations that position vLLM as the fastest inference engine globally. Your contributions will be pivotal as your code will execute across a broad spectrum of hardware accelerators, from NVIDIA GPUs to the latest silicon innovations. You'll collaborate closely with hardware vendors to ensure we fully leverage the capabilities of each new generation of chips.
At Gimlet Labs, we are pioneering the first heterogeneous neocloud tailored for AI workloads. As the demand for AI systems grows, traditional infrastructure faces significant limitations in terms of power, capacity, and cost. Our innovative platform addresses these challenges by decoupling AI workloads from the hardware, intelligently partitioning tasks, and directing each component to the most suitable hardware for optimal performance and efficiency. This method allows for the creation of heterogeneous systems that span multiple vendors and generations of hardware, including the latest cutting-edge accelerators, achieving substantial improvements in performance and cost-effectiveness.Building upon this robust foundation, Gimlet is developing a production-grade neocloud designed for agentic workloads. Our customers can effortlessly deploy and manage their workloads with stable, production-ready APIs, eliminating the complexities of hardware selection, placement, or low-level performance optimization.We collaborate with foundational labs, hyperscalers, and AI-native companies to drive real production workloads capable of scaling to gigawatt-class AI data centers.We are currently seeking a dedicated Member of Technical Staff specializing in kernels and GPU performance. In this role, you will work closely with accelerators and execution hardware to extract maximum performance from AI workloads across diverse and rapidly evolving platforms. You will analyze low-level execution behaviors, design and optimize kernels, and ensure consistent performance across both established and emerging hardware.This position is perfect for engineers who thrive on deep performance analysis, enjoy exploring hardware trade-offs, and are passionate about transforming theoretical peak performance into tangible real-world outcomes.
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California
P-1285 About This Role Join our dynamic team at Databricks as a Staff Software Engineer specializing in GenAI Performance and Kernel. In this pivotal role, you will take charge of designing, implementing, and optimizing high-performance GPU kernels that drive our GenAI inference stack. Your expertise will lead the development of finely-tuned, low-level compute paths, balancing hardware efficiency with versatility, while mentoring fellow engineers in the intricacies of kernel-level performance engineering. Collaborating closely with machine learning researchers, systems engineers, and product teams, you will elevate the forefront of inference performance at scale. What You Will Do Lead the design, implementation, benchmarking, and maintenance of essential compute kernels (such as attention, MLP, softmax, layernorm, memory management) tailored for diverse hardware backends (GPU, accelerators). Steer the performance roadmap for kernel-level enhancements, focusing on areas like vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, and auto-tuning. Integrate kernel optimizations seamlessly with higher-level machine learning systems. Develop and uphold profiling, instrumentation, and verification tools to identify correctness, performance regressions, numerical discrepancies, and hardware utilization inefficiencies. Conduct performance investigations and root-cause analyses to address inference bottlenecks, such as memory bandwidth, cache contention, kernel launch overhead, and tensor fragmentation. Create coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend compatibility, and maintainability. Influence architectural decisions to enhance kernel efficiency (including memory layout, dataflow scheduling, and kernel fusion boundaries). Guide and mentor fellow engineers focused on lower-level performance, conducting code reviews and establishing best practices. Collaborate with infrastructure, tooling, and machine learning teams to implement kernel-level optimizations in production and assess their impacts.
Join Zyphra as a Research Engineer specializing in AI Performance and Kernel Optimization. In this role, you will work at the forefront of AI technologies, developing and optimizing kernel solutions that enhance the performance of our systems. You will collaborate with cross-functional teams, leveraging your expertise to drive innovation and efficiency.
At Sciforium, we are at the forefront of AI infrastructure, innovating next-generation multimodal AI models and a proprietary high-efficiency serving platform. With substantial funding and direct collaboration from AMD, supported by their engineers, our team is rapidly expanding to develop the complete stack that powers cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented GPU Kernel Engineer who is eager to explore and maximize performance on modern accelerators. In this role, you will be responsible for designing and optimizing custom GPU kernels that drive our advanced large-scale AI systems. You will navigate the hardware-software stack, engaging in low-level kernel development and integrating optimized operations into high-level machine learning frameworks for large-scale training and inference.This position is perfect for someone who excels at the intersection of GPU programming, systems engineering, and state-of-the-art AI workloads, and aims to contribute significantly to the efficiency and scalability of our machine learning platform.Key ResponsibilitiesDevelop, implement, and enhance custom GPU kernels utilizing C++, PTX, CUDA, ROCm, Triton, and/or JAX Pallas.Profile and fine-tune the end-to-end performance of machine learning operations, particularly for large-scale LLM training and inference.Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and our proprietary internal runtimes.Create performance models, pinpoint bottlenecks, and deliver kernel-level enhancements that significantly boost AI workloads.Collaborate with machine learning researchers, distributed systems engineers, and model-serving teams to optimize computational performance across the entire stack.Engage closely with hardware vendors (NVIDIA/AMD) and stay updated on the latest GPU architecture and compiler/toolchain advancements.Contribute to the development of tools, documentation, benchmarking suites, and testing frameworks ensuring correctness and performance reproducibility.Must-Haves5+ years of industry or research experience in GPU kernel development or high-performance computing.Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related discipline.Strong programming proficiency in C++, Python, and familiarity with machine learning frameworks.
At Composio, we are developing advanced infrastructure that enables agents to seamlessly interact with essential work tools such as GitHub, Gmail, Notion, Salesforce, and more. Our dedicated team of engineers is committed to tackling challenges ranging from contextual understanding to search functionalities, ensuring we provide an exceptional bridge between your agents and their tools.Having secured a $25M Series A funding from Lightspeed, alongside prominent angel investors like Guillermo Rauch (CEO of Vercel), Dharmesh Shah (CTO of HubSpot), and Gokul Rajaram, we have experienced remarkable growth, tripling our ARR at the start of this year. Our clientele includes notable names from Y Combinator cohorts to Wabi, Glean, Zoom, and beyond.Your RoleEnhance the experience of teams utilizing our platform by refining our core APIs and SDK.Create intuitive interfaces for both frontend and SDK applications.Take ownership of product development from concept through to production.Collaborate closely with customers to cultivate their loyalty while enhancing the product.Craft clear and concise documentation.
ABOUT BASETENAt Baseten, we empower the world's leading AI firms—such as Cursor, Notion, and OpenEvidence—by delivering mission-critical inference solutions. Our unique blend of applied AI research, robust infrastructure, and user-friendly developer tools enables AI pioneers to effectively deploy groundbreaking models. With our recent achievement of a $300M Series E funding round supported by esteemed investors like BOND and IVP, we're on an exciting growth trajectory. Join our dynamic team and contribute to the platform that drives the next generation of AI products.THE ROLEWe are looking for an experienced Senior GPU Kernel Engineer to join our innovative team at the forefront of AI acceleration. In this role, your programming expertise will directly enhance the performance of cutting-edge machine learning models. You'll be responsible for developing highly efficient GPU kernels that optimize computational processes, allowing for transformative AI applications.You'll thrive in a fast-paced, intellectually challenging environment where your technical skills are pivotal. Your contributions will directly affect production systems that serve millions of users across various platforms. This position offers exceptional opportunities for career advancement for engineers enthusiastic about low-level optimization and impactful systems engineering.EXAMPLE INITIATIVESAs part of our Model Performance team, you will engage in projects like:Baseten Embeddings Inference: The quickest embeddings solution availableThe Baseten Inference StackEnhancing model performance optimizationRESPONSIBILITIESCore Engineering ResponsibilitiesDesign and develop high-performance GPU kernels for essential machine learning operations, including matrix multiplications and attention mechanisms.Collaborate with cross-functional teams to drive performance improvements and implement optimizations.Debug and refine kernel code to achieve maximal efficiency and reliability.Stay abreast of the latest advancements in GPU technology and machine learning frameworks.
About the Role OpenAI is looking for a Software Engineer specializing in Kernel Performance and AI Tooling to join the team in San Francisco. This role centers on improving software systems for maximum efficiency and building advanced tools that support AI development. What You Will Do Optimize kernel-level performance across OpenAI's software stack. Design and implement tools that accelerate AI research and deployment. Work closely with engineers to identify bottlenecks and deliver practical solutions. Contribute to technical discussions and share knowledge with teammates. Team and Collaboration Work alongside engineers who are committed to advancing AI technology. Collaboration and innovation are central to the team’s approach.
At Catalog, we are pioneering the commerce infrastructure for AI—creating the essential framework that enables digital agents to not only explore the web but also comprehend, analyze, and engage with products. Our innovations drive the future of AI-driven shopping experiences, fundamentally transforming how consumers discover and purchase items online.Role OverviewAs a Technical Staff Member, you will be instrumental in developing core systems, shaping our engineering culture, and transitioning our vision from prototype to a robust platform. This role requires full-stack expertise and a commitment to owning and resolving challenges from start to finish.Who You AreYou have experience creating beloved and trusted products from the ground up.You combine technical proficiency with a keen product sense and data-driven intuition.You are well-versed in AI technologies.You prioritize speed, write clean code, and ensure thorough instrumentation.You seek a high level of ownership within a small, talent-rich team based in San Francisco.Challenges You Will TackleDevelop and deploy agentic-search APIs that deliver structured and real-time product data in milliseconds.Build checkout systems enabling agents to conduct transactions with any merchant.Create an embeddings and retrieval layer that optimizes recall, precision, and cost efficiency.Establish a product graph and ranking pipeline that adapts based on actual user outcomes.Preferred QualificationsProven experience shipping data-centric products in a live environment.Experience with recommendation systems or information retrieval methodologies.Familiarity with API development, search indexing, and data pipeline construction.Our Work CultureWe operate with a small, high-trust, and highly motivated team, fostering an environment of in-person collaboration in North Beach, San Francisco. Our process involves debate, decision-making, and execution.If your profile aligns with our needs, we will contact you to arrange 2-3 brief technical interviews, followed by an onsite meeting in our office where you will collaborate on a small project, exchange ideas, and meet the team.
About TierZero TierZero helps engineering teams use AI to build and ship code more efficiently. The platform targets the bottleneck of human speed in production, giving teams tools for faster incident response, better operational visibility, and shared knowledge. TierZero is backed by $7M in funding from investors including Accel and SV Angel. Companies like Discord, Drata, and Framer trust TierZero to strengthen their infrastructure for AI-driven engineering. Role Overview: Founding Member of Technical Staff This is an on-site role based at TierZero’s San Francisco headquarters, with three days a week in the office. As a founding member, direct collaboration with the CEO, CTO, and early customers shapes the direction of both product and systems. The work spans hands-on development and close engagement with users and leadership. What You Will Do Design and build intelligent AI systems to analyze large volumes of unstructured data. Deliver full-stack features based on real user feedback. Improve the product experience so AI agents are both reliable and easy for engineers to use. Develop systems that automatically evaluate LLM outputs and advance agentic reasoning using self-play and feedback loops. Create machine learning pipelines, including data ingestion, feature generation, embedding stores, retrieval-augmented generation (RAG), vector search, and graph databases. Prototype with open-source and new LLMs, comparing their strengths and weaknesses. Build scalable infrastructure for long-running, multi-step agents, with attention to memory, state, and asynchronous workflows. What We Look For Over five years of relevant professional or open-source experience. Comfort working in environments with uncertainty and evolving challenges. Strong product focus and a drive for customer satisfaction. Interest in large language models (LLMs), Model Control Planes (MCPs), cloud infrastructure, and observability tools. Previous startup experience is a plus. Location This position is based in San Francisco. Expect to work on-site three days per week at TierZero’s HQ.
TierZero builds tools that help engineering teams deliver and manage code efficiently. The platform enables quicker incident response, clearer operational visibility, and shared knowledge among engineers. Backed by $7 million from investors like Accel and SV Angel, TierZero supports clients such as Discord, Drata, and Framer as they strengthen infrastructure for AI-driven work. This in-person role is based at TierZero's San Francisco headquarters, with a hybrid schedule requiring three days onsite each week. As a founding member of the technical staff, work directly with the CEO, CTO, and customers to influence the direction of TierZero’s core products and systems. The position calls for flexibility as priorities shift and close collaboration across the company. What you will do Design and develop AI systems that handle large volumes of unstructured data. Build full-stack product features, informed by direct feedback from users. Enhance the product so agents are intelligent, reliable, and easy for engineers to use. Create systems to automatically evaluate outputs from large language models and improve agentic reasoning through self-play and feedback. Construct machine learning pipelines, including data ingestion, feature creation, embedding stores, retrieval-augmented generation (RAG) pipelines, vector search, and graph databases. Experiment with open-source and emerging large language models to compare different approaches. Develop scalable infrastructure for long-running, multi-step agents, including memory, state management, and asynchronous workflows. Requirements Interest in working with large language models, managed cloud platforms, cloud infrastructure, and observability tools. At least 5 years of professional experience or significant open-source contributions. Comfort with shifting priorities and tackling new technical problems. Strong product focus and commitment to customer outcomes. Openness to learning from a team with a track record of delivering over $10 billion in value. Ability to work onsite in San Francisco three days per week. Bonus: Experience in a startup setting and familiarity with startup dynamics.
Join our dynamic team at Reka as a GPU Performance Engineer, where you will leverage your expertise in Python and large-scale model training to enhance our training infrastructure. You will play a pivotal role in optimizing model performance, contributing to critical technical decisions, and improving our post-training processes, including reinforcement learning and fine-tuning. Your contributions will also focus on enhancing the efficiency and scalability of our model serving infrastructure.
Full-time|$200K/yr - $550K/yr|On-site|San Francisco
At Magic, we are on a mission to create safe AGI that propels humanity forward in tackling the world's most pressing challenges. We believe that the key to achieving safe AGI lies in automating research and code generation, allowing us to enhance models and ensure alignment more reliably than human capabilities alone. Our innovative approach integrates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and advanced inference-time computing to realize this vision.About the Role:We are seeking a passionate individual to spearhead developer experience and data tooling within our pre-training data team. This role involves creating internal tools and infrastructure that enhance team productivity, including dashboards, command-line interfaces (CLIs), data exploration UIs, and the systems that interconnect them.Focusing on developer experience and tooling, we need someone who enjoys solving problems, deploying solutions quickly, and experimenting with new ideas.Potential Projects:Lead tooling initiatives across the architecture: develop systems, implement continuous integration, create CLI utilities, and design internal web interfaces.Design internal tools for dataset exploration, data labeling, quality assessment, and data inventory management.Enhance data infrastructure ergonomics—optimizing IO patterns in Ray/dataflow jobs, improving dataset tracking, and enhancing pipeline observability.Spot opportunities by engaging with the team, understanding their challenges, and proactively refining workflows.Elevate standards for code organization, packaging, and engineering best practices.What We Are Looking For:Preferred QualificationsSolid foundation in software engineering principles.Genuine interest in developer experience and best practices for code organization.Effective communicator, adept at collaborating with teammates to understand their requirements.Proactive mindset—identifies issues and implements solutions.Local to San Francisco (this role requires in-office attendance).Ideal Background (in order of importance)Open source contributor—experience with tools similar to Ruff, uv, or other developer-centric projects.Experience in build systems and CI—has developed or overseen build systems, CI pipelines, or developer tools on a large scale.Data pipeline experience—understanding of optimizing data workflows and data handling.
Product Engineer - Technical Staff MemberAt humans&, we are dedicated to pioneering human-centric AI solutions. Our mission revolves around reimagining artificial intelligence to prioritize human connections and interactions.We seek innovative product engineers who excel at conceptualizing and building products that enhance the interactions between people and technology. If you are passionate about this vision and have the skills to bring it to life, we would be excited to connect with you.Current Technology Stack:Backend: Convex (TypeScript)Frontend: Next.js, React Native (Expo), Tailwind
Full-time|$200K/yr - $400K/yr|Remote|San Francisco
At Inferact, we are on a mission to establish vLLM as the premier AI inference engine, revolutionizing AI progress by making inference both more accessible and efficient. Our founding team consists of the original creators and key maintainers of vLLM, positioning us uniquely at the nexus of cutting-edge models and advanced hardware.Role OverviewWe are seeking a passionate inference runtime engineer eager to explore and expand the frontiers of LLM and diffusion model serving. As models evolve and grow in complexity with new architectures like mixture-of-experts and multimodal designs, the demand for innovative solutions in our inference engine intensifies. This role places you at the heart of vLLM, where you will enhance model execution across a variety of hardware platforms and architectures. Your contributions will have a direct influence on the future of AI inference.
Krew is revolutionizing the credit-servicing sector with cutting-edge AI-driven agents designed to enhance financial wellness. Our innovative approach is supported by notable investors such as Long Journey Ventures (Arielle Zuckerberg, Pascal Levy-Garboua) and prominent angels including Ryan Hoover (Founder, Product Hunt), Charlie Songhurst (Board Member, Meta), and Michael Jones (Former Chair, Huntington Bank Ventures). Our team consists of exceptional talent from renowned organizations including Figure Robotics, Optiver, Bain, and the United Nations, alongside top-tier engineers and researchers from institutions like the University of Chicago and Oxford. Currently, our omnichannel agents assist over 100,000 consumers on their journeys towards financial stability. Our ambitious goal is to resolve more than $1 trillion in outstanding consumer delinquencies globally, employing an empathetic and human-centric approach.Job Overview:We are seeking a talented Member of Technical Staff with a focus on backend engineering. The ideal candidate is a motivated engineer who prioritizes high-quality code and thrives in a collaborative environment driven by rapid innovation.Key Responsibilities:Architect and develop backend APIs and services that empower our AI agents and operational tools.Deliver user-facing features from conception through to deployment (including data modeling, API development, and UI integration).Enhance system reliability, observability, and performance across the engineering stack.Work closely with design, operations, and compliance teams to ensure the delivery of safe and compliant features.Write comprehensive tests, maintain code quality, and engage in design review discussions.Contribute to the evolution of our engineering standards and product development methodology.Minimum Qualifications:A minimum of 2 years of full-time professional experience in backend or full-stack engineering.Proficiency in backend development technologies such as Python or Go.Strong commitment to code quality, including testing, documentation, and code reviews.Preferred Qualifications:Experience in FinTech, Risk Management, or Compliance is advantageous but not essential.Familiarity with LLM/AI-powered products, event-driven architectures, or data processing pipelines.Knowledge of security, privacy, and compliance best practices.
Join cleara as a Fullstack Engineer, where you will play a pivotal role in developing and optimizing our software solutions. You will collaborate with cross-functional teams to deliver high-quality applications that meet the needs of our clients.
Role overview Perplexity AI seeks a Software Engineer to join its Technical Staff in San Francisco. This role centers on building new software and enhancing the user experience through direct development work. What you will do Expect to design and implement new features, work on improving existing systems, and contribute to projects that shape how users interact with Perplexity AI’s products. The position involves hands-on coding and collaboration with other engineers. Location This position is based in San Francisco.
Join Composio, where we are revolutionizing the infrastructure that empowers agents to seamlessly connect with the tools you utilize daily, including GitHub, Gmail, Notion, Salesforce, and more. Our dedicated team of engineers is tackling challenges from context management to search optimization, striving to create the most efficient bridge between your agents and their essential tools.Having secured a $25M Series A funding from Lightspeed, along with support from prominent angels such as Guillermo Rauch (CEO of Vercel), Dharmesh Shah (CTO of HubSpot), and Gokul Rajaram, we have experienced significant growth, tripling our ARR this year. Our customers range from fellow Y Combinator alumni to established companies like Wabi, Glean, and Zoom.Your ResponsibilitiesEnhance our platform primitives and APIs, including authentication, automatic refreshes, triggers, tool search, planning, and sandbox management.Oversee multiple runtimes for code execution across Lambdas and Firecracker.Optimize performance through tracing, CPU/heap profiling, database query enhancements, and workflow optimization.Collaborate closely with product engineering teams and customers to effectively manage their workloads and improve our product.Produce clear and comprehensive documentation.Essential QualificationsCore Platform Engineering Skills: Extensive experience in scaling backend distributed systems, maintaining reliable systems while delivering quickly, and managing multiple platform components simultaneously.AI Expertise: Familiarity with building and working with language models.Linux Proficiency: Comfortable working in a Linux environment.Effective Communication: Ability to write well-structured documentation and articulate complex ideas clearly.Interpersonal Skills: Cultivate trust and acknowledge areas for growth.Preferred QualificationsExperience with cloud infrastructure and serverless architecture.
At Krew, we are dedicated to revolutionizing the credit-servicing landscape with cutting-edge AI-driven solutions. Our world-class team is supported by prominent investors, including Long Journey Ventures, Ryan Hoover, Charlie Songhurst, and Michael Jones. Our diverse team brings expertise from renowned organizations such as Figure Robotics, Optiver, Bain, and the United Nations, comprising engineers and researchers from prestigious institutions like UChicago and Oxford. Our innovative omnichannel agents are currently assisting over 100,000 consumers on their journey toward financial well-being. We aspire to address over $1 trillion in consumer delinquencies globally through a compassionate and human-centric approach to financial wellness.
Dec 15, 2025
Sign in to browse more jobs
Create account — see all 5,932 results
Tailoring 0 resumes…
Tailoring 0 resumes…
We'll move completed jobs to Ready to Apply automatically.