Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Mid to Senior
Qualifications
Key Responsibilities:Support fal in maintaining its leading position in model performance for generative media models.Design and implement cutting-edge approaches to model serving architecture on our in-house inference engine, emphasizing throughput maximization while minimizing latency and resource use.Develop tools for performance monitoring and profiling to identify bottlenecks and areas for optimization.Work closely with our Applied ML team and media sector clients to ensure their workloads benefit from our accelerator.Requirements:Solid foundation in systems programming with a keen ability to identify and resolve bottlenecks.In-depth knowledge of advanced ML infrastructure, including technologies such as PyTorch, TensorRT, TransformerEngine, and Nsight, encompassing model compilation, quantization, and serving architectures.Strong understanding of underlying hardware (currently Nvidia-based systems), with the ability to delve deeper into the stack to fix issues, including custom GEMM kernels with CUTLASS for common shapes.Proficiency in Triton or a willingness to learn, along with comparable experience in lower-level accelerator programming.Experience with multi-dimensional model parallelism, integrating various parallelism techniques such as tensor parallelism and context/sequence parallelism.Familiarity with the internals of Ring Attention, FA3, and FusedMLP implementations.
About the job
Join fal in our pursuit to maintain a leading edge in model performance for generative media models. You'll be instrumental in designing and implementing innovative solutions for model serving architecture, built on our proprietary inference engine. Your focus will be on maximizing throughput while minimizing latency and resource consumption. In addition, you will create performance monitoring and profiling tools to identify bottlenecks and optimization opportunities. Collaborate closely with our Applied ML team and clients in the media sector to ensure their workloads leverage our accelerator effectively.
About fal
fal is at the forefront of innovation in generative media models, continually advancing our technologies to deliver exceptional model performance. We pride ourselves on fostering a collaborative environment where creative minds can thrive and contribute to groundbreaking projects.
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Unlock Your Potential
Generate Job-Optimized Resume
One Click And Our AI Optimizes Your Resume to Match The Job Description.
Is Your Resume Optimized For This Role?
Find Out If You're Highlighting The Right Skills And Fix What's Missing
Experience Level
Mid to Senior
Qualifications
Key Responsibilities:Support fal in maintaining its leading position in model performance for generative media models.Design and implement cutting-edge approaches to model serving architecture on our in-house inference engine, emphasizing throughput maximization while minimizing latency and resource use.Develop tools for performance monitoring and profiling to identify bottlenecks and areas for optimization.Work closely with our Applied ML team and media sector clients to ensure their workloads benefit from our accelerator.Requirements:Solid foundation in systems programming with a keen ability to identify and resolve bottlenecks.In-depth knowledge of advanced ML infrastructure, including technologies such as PyTorch, TensorRT, TransformerEngine, and Nsight, encompassing model compilation, quantization, and serving architectures.Strong understanding of underlying hardware (currently Nvidia-based systems), with the ability to delve deeper into the stack to fix issues, including custom GEMM kernels with CUTLASS for common shapes.Proficiency in Triton or a willingness to learn, along with comparable experience in lower-level accelerator programming.Experience with multi-dimensional model parallelism, integrating various parallelism techniques such as tensor parallelism and context/sequence parallelism.Familiarity with the internals of Ring Attention, FA3, and FusedMLP implementations.
About the job
Join fal in our pursuit to maintain a leading edge in model performance for generative media models. You'll be instrumental in designing and implementing innovative solutions for model serving architecture, built on our proprietary inference engine. Your focus will be on maximizing throughput while minimizing latency and resource consumption. In addition, you will create performance monitoring and profiling tools to identify bottlenecks and optimization opportunities. Collaborate closely with our Applied ML team and clients in the media sector to ensure their workloads leverage our accelerator effectively.
About fal
fal is at the forefront of innovation in generative media models, continually advancing our technologies to deliver exceptional model performance. We pride ourselves on fostering a collaborative environment where creative minds can thrive and contribute to groundbreaking projects.