DevOps Engineer (SRE)
Hytech Lihat semua pekerjaan
- Kuala Lumpur
- Tetap
- Sepenuh masa
- Define, implement, and operate SRE practices, including SLA/SLO/SLI design, availability, connectivity, and disaster recovery strategies
- Lead architecture design and execution for high availability, high concurrency, and large-scale systems (e.g., microservices, service mesh, multi-active/multi-region)
- Drive system observability, security compliance, and cost optimization (e.g., cost allocation and governance)
- Design resilient architectures for mission-critical systems with high availability, elasticity, and fault tolerance
- Build observability platforms using tools such as Datadog, Prometheus, OpenTelemetry, logging systems, and alerting platforms (Flashcat/Nightingale)
- Implement full-stack monitoring across applications, infrastructure, and business metrics to enable precise issue detection
- Establish proactive monitoring systems with alerting, anomaly detection, and automated remediation capabilities
- Lead incident management (P1/P2), including rapid recovery, root cause analysis (RCA), and continuous improvement mechanisms
- Plan and implement platform engineering strategies to improve scalability, availability, and performance
- Build standardized platforms for system reliability, observability, and security while optimizing cost efficiency
- Design and optimize CI/CD pipelines (e.g., GitHub Actions, Jenkins, ArgoCD, Helm) to improve delivery speed and quality
- Establish standards for containerization, middleware, and deployment processes, ensuring scalability, reliability, and high availability
- Resolve system bottlenecks through capacity planning, performance tuning, and reliability improvements
- Deeply collaborate with business and engineering teams to embed reliability, observability, scalability, and security into system design
- Lead the definition and implementation of technical standards, security baselines, and quality control mechanisms
- Drive best practices adoption, tooling standardization, and engineering efficiency improvements
- 5+ years in SRE / DevOps / Platform Engineering or related roles
- Proven experience in designing and operating high-availability, large-scale systems
- Cloud platforms: AWS (EC2, EKS, IAM, S3, VPC, NLB/ALB, RDS, ElastiCache), or equivalent (Azure/GCP)
- Infrastructure as Code: Terraform / CloudFormation
- CI/CD & automation: Jenkins, GitHub Actions, ArgoCD, CodeBuild, Helm
- Containerization: Docker, Kubernetes (K8s)
- Observability: Metrics, Logs, Traces (e.g., Prometheus, OpenTelemetry, Datadog)
- Strong system thinking and analytical problem-solving capability
- Excellent cross-functional collaboration and communication skills
- Self-driven with strong ownership and continuous improvement mindset
- Experience in fintech, payments, or high-security environments
- Experience with high-concurrency, low-latency system design
- AI-driven operations (AIOps) or automation experience
- Certifications (e.g., AWS, CKA/CKS)
- Experience with large-scale systems or international project delivery
- Easy access to public transportation (LRT & KTM).
- Transportation allowance.
- Corporate insurance coverage, including dental, optical, and outpatient claims.
- Gym and fitness claims.
- Ongoing training and development opportunities.
- Exposure to exciting projects that support career growth and professional development.