About the job
Serve Robotics designs and operates sidewalk delivery robots that support local businesses and help ease street congestion. These robots have already completed deliveries in cities including Los Angeles, Miami, Dallas, Atlanta, and Chicago, turning robotic delivery into a routine service.
The team at Serve Robotics brings together expertise in software, hardware, and design. Machine learning and computer vision play a central role in tackling practical challenges, always with attention to user experience. Collaboration and mutual respect shape the company’s approach, with teamwork seen as essential for finding effective solutions.
Role overview
The Senior Reliability Operations Engineer in Sweden focuses on ensuring the reliability of Serve Robotics’ systems in the region. This position leads incident response, manages escalations, and provides Tier 2 support for both robotic and cloud platforms. The role involves refining runbooks, automations, and operational processes, working closely with product engineering and Site Reliability Engineering teams. As the main incident lead for Sweden, this engineer makes sure problems are resolved efficiently and that updates reach the right stakeholders.
Main responsibilities
- Lead incident response during regional daytime hours, managing technical investigations, centralizing communication, and coordinating with engineering and SRE teams for escalations.
- Handle escalated issues from Tier 1 support by using runbooks, metrics, logs, and system diagnostics to troubleshoot and resolve problems, and decide when to escalate further.
- Create and maintain runbooks, workflows, and operational documentation to ensure consistent handling of recurring issues, and collaborate with product teams to expand documentation.
- Develop and enhance automation scripts and tools to streamline remediation steps, improve response times, and reduce manual intervention.
- Use metrics, logs, and tracing tools such as Grafana, Prometheus, GCP Monitoring, and OpenTelemetry to proactively detect issues, monitor system performance, and strengthen detection methods.

