Do you want to shape the future of AI infrastructure? Ready to define the reliability architecture for AI products, from GPU compute to globally distributed inference, ensuring performance and reliability at scale.
Join the Akamai AI Team Akamai's Cloud Technology Group offers AI infrastructure globally.
The GPU compute platform provides dedicated resources, from single GPUs to full clusters.
These resources support training, simulation, inference, and various workloads.
Site Reliability Engineering is integrated early to guarantee production-grade reliability and performance.
Partner with the best As Senior Principal SRE for AI, this role involves setting technical direction for building, operating, and scaling AI services.
Additionally, mentoring team members, defining technical standards, and promoting engineering best practices are essential.
Success depends on achieving influence with product engineering teams through exceptional technical expertise.
As a Principal Site Reliability Engineer, you will be responsible for: Defining the reliability architecture for Akamai's AI compute and platform services, including SLO frameworks, fault tolerance patterns, and capacity planning models Hands-on building of automation and tooling that reduces operational toil and scales the SRE team's impact Designing observability strategy by leveraging Akamai's existing platform to build the telemetry, dashboards, alerts, and GPU-specific monitoring needed for AI workloads Architecting deployment safety practices including progressive rollouts, canary analysis, rollback automation, and change safety processes Influencing product engineering architecture and design decisions, embedding reliability into the development lifecycle at the system level Mentoring and elevating other SREs through design reviews, code reviews, and hands-on problem-solving, setting the technical bar for the.