About the job
About Phare & R1
At Phare, we are revolutionizing the healthcare industry with our groundbreaking Revenue Operating System. Our innovative platform leverages AI technology to simplify hospital billing and reimbursement, delivering accuracy and fairness. As part of R1, a leading healthcare claims management company serving hundreds of systems nationwide, we blend the agility of a startup with the resources of an established healthcare organization. Join us as we strive to create a more equitable and efficient model for healthcare payments.
The Role
As a Software Engineer focused on MLOps, you will be responsible for overseeing the production runtime of Phare’s machine learning stack. Your key tasks will include deploying, serving, and scaling models across various inference endpoints and managing batch/streaming workflows. You will create robust delivery pipelines with automated rollouts and rollbacks, ensure service level objectives for latency and availability, and implement comprehensive observability solutions. You will utilize Terraform, Kubernetes, and CI/CD to strengthen our platform and guarantee reproducible, auditable ML releases.
We are looking for candidates at various seniority levels, from mid-level to staff positions. A minimum of 5 years of software engineering experience, including at least 2 years in MLOps, is required.
This position requires in-person attendance in our SoHo office at least 3 days a week.
About You
You possess a solid background in managing ML systems at scale, where both uptime and efficient feedback loops are crucial alongside accuracy. Your experience includes:
Production ML: Proven expertise in deploying and operating models on GPUs in production environments, including APIs and batch/streaming inference.
Platform Engineering: Strong proficiency in Docker/Kubernetes, Infrastructure as Code (e.g., Terraform), and CI/CD processes for services and model artifacts, ensuring environment consistency, reproducible releases, and robust model/versioning with data lineage.
System Reliability: Experience in implementing progressive delivery with automated rollouts/rollbacks, and establishing end-to-end observability (metrics, logs, traces, and model telemetry for drift and regression), coupled with actionable alerting, runbooks, and incident response protocols.
Post-Training Lifecycles: Competence in managing model registries, stage gates, and designing scheduled or event-driven retraining processes.

