About the job
Join Perplexity, a cutting-edge company serving millions of users each day with high-quality answers powered by an LLM-first search engine and specialized data sources. We strive to leverage the latest models as they become available, navigating the complexities of the intelligence frontier where traditional benchmarks may fall short. In this pivotal role, you will be responsible for creating specialized evaluations aimed at enhancing answer quality across Perplexity, specifically focusing on search-based LLM responses and other user-favored scenarios.
Responsibilities
Design and manage automated evaluation pipelines that measure answer quality across Perplexity's products, ensuring adherence to high standards of accuracy and usefulness.
Create tailored evaluation datasets and methodologies to assess the influence of tool calls, particularly in web search retrieval, on the quality of final answers.
Develop VLM-based solutions to programmatically analyze the visual rendering of final answers across various platforms and devices.
Consistently evaluate public benchmarks and academic assessments for their relevance to Perplexity's offerings, adapting and integrating them into our ongoing performance evaluations.
Collaborate within a small, high-impact team where your evaluation metrics will directly influence product enhancements, working closely with technical leadership to measure and elevate Answer Quality.
Qualifications
PhD or MS in a technical discipline or equivalent practical experience.
A minimum of 4 years of experience in data science or machine learning.
Proficient in Python and SQL, with the ability to write production-quality code.
Experience with modern cloud data stacks, particularly AWS and Databricks.
Familiarity with agentic coding workflows and utilizing AI-assisted development tools for efficient iteration.
Preferred Qualifications
At least 1 year of experience working with LLMs at scale, especially in LLM-as-a-judge configurations.
Experience developing customer-facing web products or consumer applications with significant user traffic.
A robust research background, demonstrating the application of research methodologies to real-world machine learning challenges.
Experience in defining evaluation metrics, such as factual consistency, hallucination rate, and retrieval precision, along with creating ground truth datasets.

