System Operations Engineer

  • Kulai, Johor
  • Tetap
  • Sepenuh masa
  • 8 hari lepas
Are you passionate about data center operations and cutting-edge AI infrastructure We're looking for a DC System Operations Engineer to help us power the backbone of next-gen GPU clusters in our state-of-the-art AI Cloud facility. In this role, you'll be on the front line of maintaining the stability, performance, and security of our high-performance computing systems. From hands-on hardware replacement to system diagnostics and supporting GPU-based workloads, you&aposll be key in supporting the infrastructure behind advanced AI development. Key Responsibilities Oversee daily operations of GPU clusters and critical data center systems. Perform preventative maintenance and hardware diagnostics for GPU/CPU/storage. Monitor systems using tools like Prometheus & Grafana. Collaborate with cross-functional teams to support scalable AI infrastructure. Maintain documentation, enforce security standards, and troubleshoot issues. Who We&aposre Looking For Min. 2 years of experience in system operations, data centers, or cloud infrastructure. Strong understanding of Linux fundamentals, Kubernetes environments, and server hardware. Comfortable with hands-on IT hardware replacement and diagnostics. Familiarity with monitoring tools and basic networking concepts. Advantage: Experience with GPU servers, NVIDIA GPUs, high-performance computing, or bare metal infrastructure. Preferred Background Degree in Computer Science, Information Technology, Electrical Engineering, or equivalent experience. Why Join Us Be part of a high-growth AI infrastructure initiative under YTL. Work in a fast-paced, forward-looking environment with state-of-the-art GPU clusters. Opportunities for growth, upskilling, and cutting-edge tech exposure. Show more Show less

foundit