Site Reliability Engineer — ML Infrastructure

Anthropic·San Francisco, CA

DevOps/MLOpsSeniorFull-timeRemote
$190K-$310KPosted 2 months ago

About the Role

Anthropic is hiring an SRE to ensure the reliability and scalability of infrastructure powering Claude. You will manage production systems, develop monitoring and alerting, handle incident response, and optimize the performance of large-scale ML serving systems.

Requirements

  • 5+ years of SRE or DevOps experience
  • Strong Linux and networking fundamentals
  • Experience with Kubernetes and cloud platforms (AWS/GCP)
  • Proficiency in Python, Go, or similar languages
  • Experience with monitoring and observability tools

Nice to Have

  • Experience with ML serving infrastructure
  • Background in GPU cluster management
  • Experience with traffic management at scale
  • Familiarity with model serving frameworks

Benefits

Meaningful equity
Premium healthcare
Remote flexibility
On-call compensation
Learning budget
Home office setup

Skills

SREKubernetesAWSPythonMonitoringIncident Response

Related Jobs

Preparing for Your AI Career?

Vincony has all 400+ AI models in one place — compare responses, AI debate, Image/Video/Voice generator, and 20 more tools to help you learn and build with AI.

Visit Vincony.com