GPU and AI Software Stack System Lead
Advanced Micro Devices Lihat semua pekerjaan
- Pulau Pinang
- Tetap
- Sepenuh masa
- Linux Systems Administrator with background with modern best practices and stack understanding
- Strong problem-solving and troubleshooting skills
- Eagerness to learn, adapt to new technologies, and stay up-to-date with industry trends
- Customer service mindset for providing support to lab teams
- Detail oriented - close attention to the finer details of systems and processes to identify potential issues and areas for improvement
- Excellent written and verbal communication skills
- Support in-house automation and infrastructure solutions that can scale across multiple sites and geographies
- Support in-house automation, infrastructure, and validation solutions that scale across multiple sites and geographies.
- Design and implement automated test frameworks, tools, and scripts using Python 3 and Bash for Linux-based environments.
- Develop comprehensive end-to-end, integration, and regression test suites that simulate real developer workflows.
- Respond to and troubleshoot incidents raised by internal users or triggered by infrastructure and CI system alerts.
- Analyze failures from tool regressions and CI job results, perform root-cause analysis, and drive long-term corrective actions.
- Perform postmortem reviews and enhance processes or implement new solutions to prevent recurring outages or failures.
- Support capacity planning, performance tuning, and optimization of automation and validation solutions.
- Strong understanding of Linux, virtualization, Windows, BIOS, OS, and driver interactions.
- Solid grounding in CPU/GPU architecture; exposure to AI/HPC platform design is highly valuable.
- Experience with silicon bring-up, system-level debug, and remote hardware triage.
- Knowledge of best practices for hardware and software validation.
- Background in technical support/operations, including incident response and postmortem analysis.
- Strong troubleshooting fundamentals across networking (OSI model), systems, and infrastructure.
- Familiarity with SRE principles and modern reliability practices.
- Experience designing validation infrastructure, test plans, and automated test cases.
- Skilled in analyzing test failures and producing clear, actionable reports.
- Working knowledge of CI/CD pipelines (Jenkins, GitLab CI, Azure DevOps) and Agile development.
- Proficiency in Python and Bash; familiarity with C/C++ is a plus.
- Experience with Ansible, Git, and Infrastructure-as-Code workflows.
- Basic understanding of Kubernetes, containers, and cloud environments (Azure).
- Hands-on experience with log analysis and monitoring tools (ELK stack, Splunk).
- Understanding of relational databases such as PostgreSQL or MySQL.
- Awareness of AI/ML-driven monitoring or reliability technologies.
- Proactive, able to operate effectively in ambiguous and fast-moving environments.
- Strong communicator with solid documentation skills.
- Highly organized, with the ability to multitask and prioritize effectively.
- Certifications in cloud, Kubernetes, or Agile methodologies are a plus.
- Advocates continuous improvement, better tooling, and innovation.
- Bachelor’s degree or higher in Electrical/Computer Engineering or Electronics / Computer Science related with 8 years of experience in SoC Validation and Debug.