Site Reliability Engineer — ML Infrastructure

Anthropic·San Francisco, CA

DevOps/MLOpsSeniorFull-timeRemote

$190K-$310KPosted 2 months ago

About the Role

Anthropic is hiring an SRE to ensure the reliability and scalability of infrastructure powering Claude. You will manage production systems, develop monitoring and alerting, handle incident response, and optimize the performance of large-scale ML serving systems.

Requirements

5+ years of SRE or DevOps experience
Strong Linux and networking fundamentals
Experience with Kubernetes and cloud platforms (AWS/GCP)
Proficiency in Python, Go, or similar languages
Experience with monitoring and observability tools

Nice to Have

Experience with ML serving infrastructure
Background in GPU cluster management
Experience with traffic management at scale
Familiarity with model serving frameworks

Benefits

Meaningful equity

Premium healthcare

Remote flexibility

On-call compensation

Learning budget

Home office setup

Skills

SREKubernetesAWSPythonMonitoringIncident Response

Apply for this Position

Related Jobs

AI Safety Researcher

Anthropic · San Francisco, CA · Remote

$200K-$350KAI Safety

ML Infrastructure Engineer

Mistral AI · Paris, France · Remote

€120K-€200KDevOps/MLOps

MLOps Engineer

Databricks · San Francisco, CA · Remote

$180K-$300KDevOps/MLOps

ML Engineer — Constitutional AI

Anthropic · San Francisco, CA · Remote

$220K-$370KMachine Learning

Preparing for Your AI Career?

Vincony has all 400+ AI models in one place — compare responses, AI debate, Image/Video/Voice generator, and 20 more tools to help you learn and build with AI.

Visit Vincony.com