About the job
Join c-the-signs as a Senior MLOps Engineer, where your expertise in machine learning engineering will enable you to create and manage a cutting-edge production platform that drives ML and LLM-based healthcare workflows. You will be instrumental in designing systems that are reliable, secure, and compliant, facilitating model development, evaluation, deployment, monitoring, and ongoing enhancement. Collaborating closely with teams across ML, data science, security, and product development, you will ensure robust solutions are delivered.
This opportunity is perfect for those who have successfully deployed ML systems in production environments and are passionate about LLM orchestration, RAG, evaluations, guardrails, and observability, particularly within regulated sectors.
Key Responsibilities
MLOps & ML Platform
- Design and manage end-to-end ML platforms covering data ingestion, feature engineering, training, evaluation, deployment, and monitoring.
- Develop and uphold CI/CD processes for ML, including testing, packaging, versioning, reproducibility, automated rollbacks, and approvals.
- Adopt MLOps best practices, including model registry, experiment tracking, lineage, governance, and reproducible training environments.
- Create scalable training infrastructures with distributed training, GPU scheduling, cost controls, and auto-scaling functionalities.
- Establish and maintain feature pipelines and feature stores, ensuring alignment between training and inference to prevent training-serving skew.
- Implement comprehensive model monitoring and observability, covering performance metrics, drift detection, bias/fairness signals, latency, throughput, and data quality.
- Develop and manage LLM delivery pipelines, encompassing prompt/versioning, retrieval, orchestration, evaluation, deployment, monitoring, and iterative improvements.
- Create robust LLM evaluation frameworks that include offline and online components, automated regression testing, human-in-the-loop review workflows, and risk assessment.
- Establish cost management protocols focusing on token/cost budgeting, caching strategies, autoscaling, and performance optimization.
Deployment, Reliability, and Operations
- Deploy ML models on GCP utilizing containers and orchestration tools (e.g., GKE, Cloud Run) while building CI/CD for ML/LLM systems featuring automated tests and safe rollouts.
- Implement observability practices which include tracing, metrics, logs, dashboards, and alerts for model/system health metrics like latency, token usage, error rates, and drift indicators.
Data, Governance, and Compliance (Healthcare)
- Design systems with a focus on security and privacy from the outset, incorporating IAM, least privilege, secrets management, audit logs, encryption, and PHI/PII handling protocols.
- Establish governance frameworks that cover model/prompt lineage, dataset provenance, evaluation traceability, and approval workflows in line with healthcare compliance standards.

