Senior Site Reliability Engineer (SRE) - Compute Node Team

NebiusAmsterdam, Netherlands; Remote - Europe

Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Senior

Qualifications

We Expect You to Have:In-depth Linux expertise, including a comprehensive understanding of both user and kernel spaces. Knowledge of kernel subsystems and their intricacies.

About the job

Why Choose Nebius?
Nebius is at the forefront of revolutionizing cloud computing, catering specifically to the global AI economy. Our mission is to provide our clients with the essential tools and resources needed to tackle real-world challenges and innovate industries, all without incurring hefty infrastructure expenses or the necessity of assembling large in-house AI/ML teams. Join us and collaborate with some of the brightest minds in AI cloud infrastructure, alongside seasoned leaders and engineers.

Where We Operate
Founded in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers located throughout Europe, North America, and Israel. Our workforce comprises over 1,400 dedicated professionals, including more than 400 highly skilled engineers proficient in both hardware and software engineering, complemented by a dedicated in-house AI R&D team.

Your Role

As a Senior Site Reliability Engineer (SRE) within the Compute Node team at Nebius AI Cloud, you will play a pivotal role in constructing and managing the cluster scheduler and node-level services that oversee and maintain virtual machines across our cloud regions. The focus of this role is on Linux systems engineering, virtualization, and operational reliability. You will work closely with the operating system and hypervisor, influencing the integration of reliability and observability within the Compute platform.

Your Key Responsibilities:

Guarantee the reliability, availability, and performance of compute nodes hosting virtual machines.
Analyze and troubleshoot Linux systems at both user and kernel space, recognizing their capabilities, limitations, and trade-offs.
Resolve intricate production issues involving CPU, memory, NUMA, cgroups, and scheduling.
Engage hands-on with virtualization and containerization using QEMU/KVM and Linux-based technologies.
Develop and enhance observability as a core capability of the node layer, including metrics, logs, traces, alerts, SLIs, and SLOs.
Lead incident response efforts, conduct root-cause analyses, and perform postmortems, driving long-term enhancements in reliability.
Work in close partnership with platform, kernel/hypervisor, GPU, and infrastructure teams to refine system design and operability.

About Nebius

Nebius is a pioneering company in the realm of cloud computing, dedicated to addressing the needs of the global AI economy. Our innovative solutions enable businesses to overcome real-world challenges and innovate, all while keeping infrastructure costs manageable. With a strong team of experienced engineers and leaders, we are reshaping the future of AI cloud infrastructure.

1 - 20 of 476 Jobs

Search for Site Reliability Engineer For Cutting Edge Machine Learning Platform

476 results

Select all on this page (20)

Apply

Site Reliability Engineer for Cutting-Edge Machine Learning Platform

dev2

Full-time|On-site|Amsterdam

Join our innovative team at dev2 as a Site Reliability Engineer, where you'll play a pivotal role in enhancing our cutting-edge Machine Learning Platform. You will be responsible for ensuring the reliability, availability, and performance of our systems while collaborating with cross-functional teams to implement best practices in software engineering and op…

Nov 7, 2021

Apply

Senior Site Reliability Engineer at Nebius | Amsterdam

Nebius

Full-time|On-site|Amsterdam, Netherlands

Why Join Nebius?Nebius is at the forefront of a transformative era in cloud computing, designed to empower the global AI economy. We provide innovative tools and resources that enable our clients to tackle real-world challenges and revolutionize industries, all while minimizing infrastructure costs and eliminating the necessity for extensive in-house AI/ML teams. Our workforce operates at the cutting edge of AI cloud infrastructure, collaborating with some of the industry’s most experienced and pioneering leaders and engineers.Where We OperateBased in Amsterdam and publicly listed on Nasdaq, Nebius boasts a worldwide presence with research and development hubs in Europe, North America, and Israel. Our team of over 1,400 professionals includes more than 400 highly skilled engineers, proficient in both hardware and software engineering, alongside a dedicated in-house AI research and development team.The RoleNebius is seeking a talented Senior Site Reliability Engineer to join our Hardware Infrastructure team. You will have the opportunity to work from our vibrant office in Amsterdam.The Hardware Infrastructure team is responsible for designing, developing, and maintaining systems integral to the data center lifecycle:Functional and load testing systems.Monitoring engineering equipment in our data centers (power supply, air and water cooling, etc.).Monitoring IT assets: racks, servers, JBODs, JBOGs, power shelves, network devices, etc.Asset management and tracking.Tracking hardware repair tasks.Server production oversight.Your Responsibilities Will Include:Ensuring fault tolerance, scalability, and uninterrupted service operation.Utilizing state-of-the-art technologies to address various infrastructure challenges.Implementing and refining CI/CD processes.We Expect You to Have:Expertise in Linux systems, alongside proficiency in Python and Bash scripting for automation.A proven track record of troubleshooting complex system issues, encompassing hardware, software, and networking.Strong analytical skills and adept problem-solving capabilities, aimed at optimizing system performance.Proficiency in English.Bonus Skills:An interest in backend development.Experience in designing, developing, and managing high-load distributed systems.

Apr 30, 2026

Apply

Senior Site Reliability Engineer - Token Factory (Inference Platform)

Nebius

Full-time|Remote|Amsterdam, Netherlands; Berlin, Germany; London, United Kingdom; Prague, Czech Republic; Remote - Europe; Remote - United States; United States

Why join Nebius?Nebius is at the forefront of a revolutionary shift in cloud computing, dedicated to empowering the global AI economy. We provide innovative tools and resources that enable our clients to tackle real-world challenges and revolutionize their industries without incurring substantial infrastructure costs or the necessity of assembling extensive in-house AI/ML teams. Our workforce operates on the cutting edge of AI cloud infrastructure, collaborating with some of the most seasoned and creative leaders and engineers in the industry.Our Work EnvironmentHeadquartered in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers across Europe, North America, and Israel. Our team consists of over 1,400 professionals, including more than 400 highly skilled engineers with profound expertise in both hardware and software engineering, complemented by an in-house AI R&D team.As part of Nebius Cloud, one of the largest GPU clouds globally, the Token Factory team operates tens of thousands of GPUs. We are developing an inference platform designed to deploy a variety of foundation models — including text, vision, audio, and cutting-edge multimodal architectures — quickly, dependably, and effortlessly at scale. To achieve this goal, we are seeking an engineer capable of ensuring the platform operates flawlessly under heavy loads and can recover seamlessly from unexpected issues.In this position, you will take ownership of the reliability, performance, and observability of the complete inference stack. Your day may start with designing and refining telemetry pipelines — turning hundreds of terabytes of signals into actionable insights through metrics, logs, and traces. You might also optimize Kubernetes autoscalers for enhanced GPU efficiency, create Terraform modules that incorporate resilience into every new cluster, or strengthen our request-routing and retry logic to ensure that transient failures remain unnoticed by users. When incidents occur, you will utilize the automation and runbooks you’ve developed to swiftly detect, isolate, and address issues, while fostering a post-mortem culture to prevent future occurrences. All these efforts are directed towards a singular objective: achieving smooth platform scaling while meeting rigorous cost and reliability targets.Success in this role requires a deep understanding of Kubernetes, Prometheus, Grafana, Terraform, and the principles of infrastructure-as-code. You should be comfortable scripting in Python or Bash, grasp the intricacies of alert design and SLOs for high-throughput APIs, and have enough production experience to recognize how distributed back-ends can fail in real-world scenarios. Experience managing GPU-intensive workloads — whether with vLLM, Triton, Ray, or a similar accelerator stack — will be advantageous, as will a background in MLOps or model-hosting platforms.

Apr 23, 2026

Apply

Site Reliability Engineer | Trading Operations

Jump Trading

Full-time|On-site|Amsterdam

Join Jump Trading as a Site Reliability Engineer in our Trading Operations team. In this pivotal role, you will ensure the reliability and performance of our trading systems, utilizing your expertise to implement best practices in system design and operations.Your responsibilities will include monitoring system performance, troubleshooting issues, and collaborating with software engineers to improve system architecture. Your contributions will play a critical role in maintaining our competitive edge in the trading industry.

Mar 30, 2026

Apply

Senior Site Reliability Engineer (SRE)

Nebius

Full-time|Remote|Amsterdam, Netherlands; Israel; Remote - Europe

Why choose Nebius?Nebius is at the forefront of revolutionizing cloud computing to empower the global AI economy. We develop essential tools and resources that enable our clients to tackle real-world problems and innovate across industries—all without incurring substantial infrastructure costs or the necessity of assembling large in-house AI/ML teams. Our team operates at the cutting-edge of AI cloud infrastructure, collaborating with some of the most experienced and innovative leaders and engineers in the industry.Our Work EnvironmentWith our headquarters in Amsterdam and a presence on Nasdaq, Nebius boasts a global footprint with R&D hubs across Europe, North America, and Israel. Our workforce of over 1400 includes more than 400 expert engineers with extensive experience in hardware and software engineering, alongside a dedicated in-house AI R&D team.The RoleYour responsibilities will include:Ensuring fault tolerance, scalability, and uninterrupted operations for our services.Utilizing cutting-edge cloud technology to address various infrastructure challenges.Implementing and enhancing CI/CD processes.We expect you to have:Strong experience with programming languages such as Go, Python, or C++.A solid understanding of classic algorithms and data structures.Commercial experience with and a deep understanding of Unix systems and networking technologies.Experience with containerization and configuration management tools like Ansible, Salt, Terraform, Docker, Kubernetes, and Helm.Bonus points for:A keen interest in backend development.Experience in designing, developing, and managing high-load distributed systems.Commercial experience across various cloud platforms.Coding interviews are part of our hiring process.What we offer:A competitive salary and a comprehensive benefits package.Opportunities for professional advancement within Nebius.Flexible working arrangements.A dynamic, collaborative work environment that fosters initiative and innovation.

Apr 23, 2026

Apply

Site Reliability Engineer at airapps | Amsterdam

airapps

Full-time|On-site|Amsterdam

airapps is seeking a Site Reliability Engineer (SRE) based in Amsterdam. This position centers on maintaining the reliability, scalability, and performance of core systems. Role overview The SRE works alongside both development and operations teams. The main focus is to keep infrastructure running smoothly and to improve service quality for users. What you will do Monitor and support system reliability and uptime Collaborate with developers and operations staff to optimize infrastructure Contribute to enhancing the overall user experience by ensuring stable services Location This role is based in Amsterdam.

Apr 28, 2026

Apply

Machine Learning Performance Engineer

Pinely

Full-time|On-site|Amsterdam, North Holland, Netherlands

Join our innovative team at Pinely as a Machine Learning Performance Engineer. We are on a mission to accelerate large-scale model training by optimizing our internal infrastructure and computing stack. In this pivotal role, you will engage with the entire training pipeline—from GPU kernels to system-wide throughput—utilizing profiling, CUDA-level tuning, and advanced distributed systems methodologies. Your contributions will be vital in minimizing training durations, enhancing iteration speeds, and maximizing computational efficiency.As a key member of our growing team, you will help cultivate deep technical expertise in ML training systems.Responsibilities:Enhance our model training pipeline to increase speed and reliability, facilitating quicker and more effective experimentation.Utilize GPU optimization techniques via tools like JAX, Triton, and low-level CUDA to elevate training performance and efficiency at scale.Diagnose and rectify performance bottlenecks throughout the ML pipeline—from data loading and preprocessing to CUDA kernels.Develop tools and expand our internal infrastructure to enable scalable, reproducible, and high-performance training workflows.Guide and mentor engineers and researchers in implementing performance best practices across the team.Assist in enhancing the team's capabilities in GPU and systems-level expertise, contributing to a culture of engineering excellence and rapid experimentation.Requirements:Proven experience optimizing neural network training in production or large-scale research environments, such as reducing training time, enhancing hardware utilization, or expediting feedback cycles for ML researchers.Extensive hands-on experience with ML frameworks like PyTorch or JAX.Practical experience training and optimizing deep learning architectures, including LSTM and Transformer-based models with various attention mechanisms.Familiarity with CUDA, Triton, or other low-level GPU technologies for performance tuning.Expertise in profiling and debugging training pipelines using tools like Nsight, cprofiler, CUDA, gdb, or torch profiler.Comprehension of distributed training concepts including data/model/tensor/sequence/pipeline/context parallelism and memory-compute trade-offs.A collaborative and proactive approach, coupled with strong communication skills and the ability to mentor team members effectively.Strong proficiency in Python for developing infrastructure-level tools, debugging training systems, and integrating with ML frameworks and profiling tools.What We Offer:Competitive salary and comprehensive social benefits.Attractive bonus structure; we are flexible in discussions regarding salary and employment conditions.Access to state-of-the-art hardware and software in production, alongside a highly skilled technical team.

Sep 26, 2025

Apply

Site Reliability Engineer at pinely | Amsterdam

pinely

Full-time|On-site|Amsterdam, North Holland, Netherlands

Join pinely as we expand our innovative team! We are seeking a dedicated Site Reliability Engineer who thrives in a dynamic environment.Key Responsibilities:Deploy, configure, and manage Linux-based servers efficiently.Diagnose and resolve hardware and network availability issues while monitoring for failures.Oversee numerous nodes across various remote sites and cloud infrastructures.Contribute to infrastructure automation initiatives using Python and/or Go.Engage with cloud platforms including AWS, Google Cloud, and Alibaba Cloud.Enhance monitoring systems for production trading environments utilizing Grafana.Required Qualifications:A minimum of 3 years of experience in managing and troubleshooting high-load systems.Strong grasp of the Linux TCP/IP stack.Familiarity with essential network components such as DHCP, DNS, and BGP.Proficiency in at least one configuration management tool (e.g., Salt, Ansible).Extensive knowledge of infrastructure monitoring tools, including Prometheus and Grafana.Fluent in English (B2/Upper-Intermediate or above).Basic skills in Python/Bash/Go.Willingness to travel for work-related tasks.Preferred Qualifications:Familiarity with leading server hardware brands.Experience optimizing hardware and OS configurations for peak performance.What We Offer:Competitive salary and comprehensive social benefits.Attractive bonus structure with flexibility in salary negotiations.Opportunity to work with unique networks such as radio relay, shortwave, FPGA cards, and atomic clocks, including server optimization on overclocked systems.Access to cutting-edge technologies and a supportive environment for implementing innovative solutions.Flexible working conditions, minimizing bureaucracy and promoting autonomy.Tuition reimbursement and sponsorship for conferences and training.

Feb 25, 2026

Apply

Machine Learning Engineer

Sia

Full-time|On-site|Amsterdam

Join Sia as a Machine Learning Engineer and play a pivotal role in advancing our innovative technology solutions. You will collaborate with a talented team of data scientists and software developers to design, implement, and optimize machine learning models that drive our projects forward.In this position, you will have the opportunity to work on cutting-edge technologies and methodologies, contributing to impactful projects that enhance user experiences and improve operational efficiencies.

Mar 31, 2026

Apply

Machine Learning Engineer

bloomon

On-site|On-site|Amsterdam

Join Bloom & Wild as a Machine Learning Engineer and contribute to our mission of transforming the gifting experience. As part of a dynamic data science team, you will leverage cutting-edge technologies to develop machine learning models that enhance our customer offerings. You will work collaboratively with data engineering, product teams, and business intelligence to explore innovative data-driven solutions. Your expertise in Python and AWS will be vital in maintaining and optimizing our existing models while exploring new opportunities for growth.

Dec 23, 2025

Apply

Azure Machine Learning Engineer at Devoteam | Amsterdam

Devoteam

Full-time|On-site|Amsterdam

As an Azure Machine Learning Engineer, you will be at the forefront of our AI and Machine Learning practice in the Netherlands. You will lead the way in the Azure AI domain and play a key role in shaping our AI strategy, both internally and for our clients. Your ability to merge technical depth with a pragmatic approach will enable you to successfully bring AI solutions into production.You will serve as a vital partner to both consultants and clients, helping them maximize the power of AI within the Microsoft ecosystem. Your expertise will encompass Azure Machine Learning, Azure OpenAI, Cognitive Services, Fabric, Databricks, and MLOps solutions in Azure.Your contributions will advance our AI knowledge within the team, collaborate with international colleagues, and support our commercial teams in designing innovative AI and ML projects. You will work on end-to-end solutions: from use case discovery and model development to deployment, monitoring, and optimization.Your role is crucial for our growth in the AI domain, helping position Devoteam as a thought leader in Azure AI and contributing to projects that deliver genuine business impact.

Feb 26, 2026

Apply

Lead Principal Machine Learning Engineer

IMC Trading

Full-time|$200K/yr - $250K/yr|On-site|Amsterdam, Netherlands; Chicago, United States; Hong Kong, Hong Kong; London, United Kingdom; New York, United States; Sydney, Australia

At IMC Trading, we recognize that technology is the cornerstone of our competitive advantage, and machine learning plays a pivotal role in our trading strategies. In recent years, we have diligently enhanced our machine learning capabilities by building robust infrastructure, expanding our in-house GPU cluster, deploying models into production, and collaborating closely with quantitative researchers and traders to create significant impact. As we continue to grow, we are looking to expand our team, enhance our systems, and accelerate the integration of deep learning into our research and execution workflows.We are seeking a Lead Principal Machine Learning Engineer to guide the next evolution of our platform — influencing architecture, promoting best practices, and addressing high-impact challenges. You will collaborate with researchers and technologists to design systems that facilitate experimentation, training, and deployment of machine learning models, while also helping to define the future approach to machine learning at IMC as we scale. If you have experience building machine learning infrastructure at scale and want to play a key role in shaping our firm's trajectory, we invite you to connect with us.

Mar 12, 2026

Apply

Senior Machine Learning Engineer

KPN

Full-time|On-site|Amsterdam

Role Overview KPN is looking for a Senior Machine Learning Engineer to join the team in Amsterdam. This role focuses on building and refining machine learning models that shape both internal operations and customer-facing services. What You Will Do Design, develop, and implement advanced machine learning algorithms Work closely with colleagues to create AI solutions that improve KPN’s offerings Contribute expertise to projects that directly influence customer experience and business processes Location This position is based in Amsterdam.

Apr 14, 2026

Apply

Senior MLOps Engineer - Machine Learning Workflows

JetBrains s.r.o.

Full-time|Remote|Amsterdam, Netherlands; Belgrade, Serbia; Berlin, Germany; Limassol, Cyprus; Munich, Germany; Paphos, Cyprus; Prague, Czech Republic; Remote, Germany; Warsaw, Poland; Yerevan, Armenia

At JetBrains, we are passionate about code. Since our inception in 2000, our mission has been to develop the most powerful and effective developer tools available. By automating routine checks and corrections, our tools accelerate production, allowing developers to explore, innovate, and create. As AI-driven support becomes integral to our IDEs, the ML Workflows Engineering team focuses on eliminating infrastructure challenges, optimizing machine learning operations (MLOps), and empowering teams to concentrate on their most impactful work—developing groundbreaking ML models and intelligent agents. In this role, you will significantly contribute to designing tools, automation, and pipelines that facilitate a seamless and intuitive machine learning development experience. By embracing advanced MLOps practices and engineering excellence, we strive to enhance productivity and simplify ML infrastructure, enabling our teams to push the limits of AI innovation.

Feb 19, 2026

Apply

Senior Machine Learning Engineer at KPN | Amsterdam

KPN

Full-time|On-site|Amsterdam

About the Role KPN is looking for a Senior Machine Learning Engineer in Amsterdam. This role focuses on designing and building advanced machine learning models that support and improve KPN’s telecommunications and IT services. What You Will Do Create and implement machine learning solutions tailored to business needs Work closely with teams across disciplines to analyze data and extract insights Help shape technology decisions and contribute to ongoing innovation at KPN Location This position is based in Amsterdam.

Apr 14, 2026

Apply

Senior Site Reliability Engineer (SRE) - Compute Node Team

Nebius

Full-time|Remote|Amsterdam, Netherlands; Remote - Europe

Why Choose Nebius?Nebius is at the forefront of revolutionizing cloud computing, catering specifically to the global AI economy. Our mission is to provide our clients with the essential tools and resources needed to tackle real-world challenges and innovate industries, all without incurring hefty infrastructure expenses or the necessity of assembling large in-house AI/ML teams. Join us and collaborate with some of the brightest minds in AI cloud infrastructure, alongside seasoned leaders and engineers.Where We OperateFounded in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers located throughout Europe, North America, and Israel. Our workforce comprises over 1,400 dedicated professionals, including more than 400 highly skilled engineers proficient in both hardware and software engineering, complemented by a dedicated in-house AI R&D team.Your RoleAs a Senior Site Reliability Engineer (SRE) within the Compute Node team at Nebius AI Cloud, you will play a pivotal role in constructing and managing the cluster scheduler and node-level services that oversee and maintain virtual machines across our cloud regions. The focus of this role is on Linux systems engineering, virtualization, and operational reliability. You will work closely with the operating system and hypervisor, influencing the integration of reliability and observability within the Compute platform.Your Key Responsibilities:Guarantee the reliability, availability, and performance of compute nodes hosting virtual machines.Analyze and troubleshoot Linux systems at both user and kernel space, recognizing their capabilities, limitations, and trade-offs.Resolve intricate production issues involving CPU, memory, NUMA, cgroups, and scheduling.Engage hands-on with virtualization and containerization using QEMU/KVM and Linux-based technologies.Develop and enhance observability as a core capability of the node layer, including metrics, logs, traces, alerts, SLIs, and SLOs.Lead incident response efforts, conduct root-cause analyses, and perform postmortems, driving long-term enhancements in reliability.Work in close partnership with platform, kernel/hypervisor, GPU, and infrastructure teams to refine system design and operability.

Apr 23, 2026

Apply

Machine Learning Engineer at Altas Technologies | IMC

IMC Trading

Full-time|On-site|Amsterdam, Netherlands

At IMC, technology is the foundation of our operations. Our cutting-edge proprietary software drives millions of trading decisions every day, allowing us to stay ahead of the competition through rapid and efficient decision-making. In 2023, IMC expanded its capabilities by acquiring Altas Technologies, a dynamic algorithmic trading firm dedicated to developing the most advanced trading stack for the future. This strategic acquisition combines Altas's sophisticated trading strategies with IMC's robust execution and scaling capabilities, reinforcing our market position and paving the way for sustained stability and growth. The Role You will join a dedicated team of engineers collaborating closely with a larger group of researchers. This structure is intentional — it drives us to create high-quality, efficient systems. With limited personnel, we must ensure the reliability of our systems. You will take full ownership of what you build: from design, through implementation, to maintenance. The role encompasses a wide range of responsibilities: real-time ML inference powering trading strategies, the large-scale data infrastructure that supports it, and the research platform that enables quantitative researchers to discover alpha. If you prefer to specialize in just one area and delegate the rest, this may not be the position for you. We delve deeply into every aspect of our work, which presents exciting challenges and continuous learning opportunities. How We Work Competence is our standard, not our distinguishing factor. While individual performance is important, what truly differentiates us is the caliber of our ideas and our collaborative spirit — we value discussion and mutual respect, leaving egos at the door. Responsibilities Develop and refine the real-time inference system — delivering ML predictions with sub-millisecond latency as part of a live trading system. Construct and sustain petabyte-scale data and ML infrastructure — ensuring high-throughput ingestion into our data lake, orchestrating end-to-end ML pipelines, and managing everything in between. Facilitate alpha research and transition it to production — create the research platform that researchers rely on daily: ensuring rapid loading of TB-scale datasets, providing horizontally scalable compute for experiments, conducting feature engineering, and performing backtesting. Collaborate with researchers to rapidly implement their concepts into production without compromising quality. What We Expect From You You should be proficient in a systems programming language — preferably Rust or C++ — along with Python. Mastery of both is essential for navigating the entire stack.

Mar 25, 2026

Apply

Senior Network Site Reliability Engineer at Nebius | Amsterdam, Netherlands

Nebius

Full-time|Remote|Amsterdam, Netherlands; Remote - Europe

Why Join NebiusNebius is pioneering a transformative era in cloud computing, tailored to meet the demands of the global AI economy. We provide the essential tools and resources that empower our clients to address real-world challenges and revolutionize their industries without incurring substantial infrastructure costs or assembling large in-house AI/ML teams. Our workforce is engaged at the forefront of AI cloud infrastructure, collaborating with some of the most talented and innovative leaders and engineers in the industry.Our Work EnvironmentHeadquartered in Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with R&D centers across Europe, North America, and Israel. Our diverse team of over 1400 professionals includes more than 400 highly skilled engineers, well-versed in both hardware and software engineering, complemented by an in-house AI R&D team.The RoleWe are seeking a Network Site Reliability Engineer (NetSRE) to play a critical role in developing and maintaining the foundational infrastructure of Nebius—the Network, which is essential for all other services. This engineering-centric SRE position will involve defining clear reliability objectives, implementing the necessary tooling and automation to achieve them, while enhancing the operational safety of the network as we scale rapidly.Your Responsibilities Will Include:Establish and oversee reliability benchmarks for network services and critical pathways (including SLIs/SLOs, availability targets, and error budgets as applicable).Enhance reliability across the entire network, focusing not just on services, but also on site readiness, inter-site connectivity (DCI), and operational protocols.Lead incident response efforts in your areas, directing investigations/postmortems and transforming failures into sustainable solutions rather than recurring issues.Develop and refine observability tools including actionable metrics, logs, traces, alerting systems, and expedited debugging processes.

Apr 30, 2026

Apply

Site Reliability Engineer (SRE) - AI Infrastructure (Entry Level)

Nebius

Internship|On-site|Amsterdam, Netherlands

Why Join Nebius?Nebius is at the forefront of a transformative wave in cloud computing, dedicated to empowering the global AI economy. We provide essential tools and resources that enable our customers to tackle real-world challenges and revolutionize industries—all while avoiding exorbitant infrastructure expenses and the necessity of large in-house AI/ML teams. Our staff operates at the leading edge of AI cloud infrastructure, collaborating with some of the most innovative leaders and engineers in the field.Our Work EnvironmentBased in the vibrant city of Amsterdam and publicly traded on Nasdaq, Nebius boasts a worldwide presence with research and development hubs across Europe, North America, and Israel. Our diverse team of over 1400 professionals includes more than 400 highly skilled engineers, bringing extensive expertise in both hardware and software engineering, complemented by a dedicated in-house AI R&D team.Position Summary:Location: AmsterdamDuration: 3 monthsStart Date: June 2026Compensation: PaidEligibility: Current university student pursuing a degree in Computer Science or a related field, recent graduates, or early career professionalsWork Authorization: Authorized to work in the job's location

Apr 23, 2026

Apply

Director of Machine Learning

JetBrains s.r.o.

Full-time|Remote|Amsterdam, Netherlands; Belgrade, Serbia; Berlin, Germany; Limassol, Cyprus; London, United Kingdom; Munich, Germany; Paphos, Cyprus; Prague, Czech Republic; Remote, Germany; Warsaw, Poland; Yerevan, Armenia

At JetBrains, we live and breathe code. Since our inception in 2000, our mission has been to create the world's most powerful and efficient developer tools. By automating routine checks and corrections, our products accelerate development processes, empowering developers to innovate and create freely. JetBrains is transitioning from standalone developer tools to an integrated, AI-driven platform for software development. The role of AI has evolved from a simple assistant within the editor to a vital participant in the planning, building, reviewing, and operating of software across teams and organizations. This transformation presents new challenges that cannot be addressed at the individual tool level: governance, security, cost management, observability, and synchronized collaboration between humans and autonomous agents. Our ambition is to develop a platform that facilitates the adoption of AI in software development in a structured, scalable, and cost-effective manner, without confining companies to closed ecosystems. This platform will act as the execution and governance layer for AI-powered development, seamlessly integrated with developer tools while functioning across teams, products, and environments. We are in search of a seasoned ML leader who has a proven track record of developing products with an ML foundation, harmonizing research, technical excellence, and a strong focus on product.

Feb 19, 2026

Create account — see all 476 results

Senior Site Reliability Engineer (SRE) - Compute Node Team

Experience Level

Qualifications

About the job

Your Role

About Nebius

Similar jobs