About the job
Why Join Nebius?
Nebius is at the forefront of a transformative era in cloud computing, designed to empower the global AI economy. We provide innovative tools and resources that enable our clients to tackle real-world challenges and revolutionize industries, all while minimizing infrastructure costs and eliminating the necessity for extensive in-house AI/ML teams. Our workforce operates at the cutting edge of AI cloud infrastructure, collaborating with some of the industry’s most experienced and pioneering leaders and engineers.
Where We Operate
Based in Amsterdam and publicly listed on Nasdaq, Nebius boasts a worldwide presence with research and development hubs in Europe, North America, and Israel. Our team of over 1,400 professionals includes more than 400 highly skilled engineers, proficient in both hardware and software engineering, alongside a dedicated in-house AI research and development team.
The Role
Nebius is seeking a talented Senior Site Reliability Engineer to join our Hardware Infrastructure team. You will have the opportunity to work from our vibrant office in Amsterdam.
The Hardware Infrastructure team is responsible for designing, developing, and maintaining systems integral to the data center lifecycle:
- Functional and load testing systems.
- Monitoring engineering equipment in our data centers (power supply, air and water cooling, etc.).
- Monitoring IT assets: racks, servers, JBODs, JBOGs, power shelves, network devices, etc.
- Asset management and tracking.
- Tracking hardware repair tasks.
- Server production oversight.
Your Responsibilities Will Include:
- Ensuring fault tolerance, scalability, and uninterrupted service operation.
- Utilizing state-of-the-art technologies to address various infrastructure challenges.
- Implementing and refining CI/CD processes.
We Expect You to Have:
- Expertise in Linux systems, alongside proficiency in Python and Bash scripting for automation.
- A proven track record of troubleshooting complex system issues, encompassing hardware, software, and networking.
- Strong analytical skills and adept problem-solving capabilities, aimed at optimizing system performance.
- Proficiency in English.
Bonus Skills:
- An interest in backend development.
- Experience in designing, developing, and managing high-load distributed systems.
