About the job
At Tensorwave Cloud, our goal is to develop a seamless, secure, and robust AI infrastructure that scales efficiently, breaking down obstacles and redefining norms to empower innovators and foster AI advancements.
About the Role
We are on the lookout for a dedicated Infrastructure Operations Engineer to become a vital part of our expanding infrastructure team. This position is perfect for individuals who excel in hardware-focused settings, take pleasure in hands-on data center and system administration tasks, and possess the ability to create reliable automation for large-scale infrastructure.
In this role, you will oversee managing enterprise hardware, monitoring systems, network operations, infrastructure automation, and supporting our computing clusters across diverse data centers. This position encompasses all facets of modern infrastructure, from bare metal provisioning to OS and Kubernetes management, to hardware monitoring and troubleshooting.
If you are meticulous, resourceful, and comfortable working with both low-level hardware systems and advanced DevOps tools, we would love to hear from you.
Responsibilities
Oversee and maintain enterprise-grade server hardware, including diagnostics and repairs for CPUs, memory, disks, PSUs, and NICs.
Utilize out-of-band management systems for remote access and recovery, iLO, iDRAC, IPMI, Redfish.
Design, establish, and sustain infrastructure monitoring and alerting systems, Prometheus, Grafana, SNMP, or similar tools.
Administer and troubleshoot Linux systems, OS installation, boot issues, services, networking, filesystems, and access controls.
Manage bare-metal provisioning workflows, PXE/UEFI boot and automated node bring-up using MAAS, Foreman, or equivalent systems.
Develop and maintain infrastructure automation using shell scripting and CLI tools to enhance reliability and scale operations.
Manage core networking components, subnets, IP address management, VLANs, routing, NAT, and firewall configurations.
Configure and support secure connectivity solutions such as VPNs, IPsec, WireGuard, OpenVPN.
Support Kubernetes clusters at the infrastructure level, node lifecycle management, access, basic troubleshooting, and scaling.
Collaborate with internal teams to ensure compute clusters remain reliable, secure, and scalable across multiple data centers.
