Infrastructure Engineer — Training Clusters

xAI·Memphis, TN

DevOps/MLOpsSeniorFull-time

$180K-$300KPosted 1 months ago

About the Role

xAI is hiring an Infrastructure Engineer to manage and scale the massive GPU clusters powering Grok's training. You will work on hardware provisioning, network optimization, storage systems, and reliability engineering for one of the largest AI training facilities.

Requirements

5+ years of infrastructure or SRE experience
Experience managing large-scale GPU/HPC clusters
Strong Linux systems administration skills
Knowledge of high-speed networking (InfiniBand, RoCE)
Experience with configuration management and automation

Nice to Have

Experience with NVIDIA H100/B200 GPU clusters
Background in data center operations
Familiarity with Slurm or similar job schedulers
Experience with storage systems for ML (Lustre, GPFS)

Benefits

Significant equity

Health and wellness benefits

Relocation to Memphis supported

On-site amenities

Cutting-edge hardware access

Flexible schedule

Skills

InfrastructureGPU ClustersLinuxNetworkingSREAutomation

Apply for this Position

Related Jobs

Prompt Engineer

xAI · Palo Alto, CA

$150K-$250KAI Engineering

ML Infrastructure Engineer

Mistral AI · Paris, France · Remote

€120K-€200KDevOps/MLOps

MLOps Engineer

Databricks · San Francisco, CA · Remote

$180K-$300KDevOps/MLOps

MLOps Engineer — Azure AI

Microsoft · Seattle, WA · Remote

$170K-$280KDevOps/MLOps

Preparing for Your AI Career?

Vincony has all 400+ AI models in one place — compare responses, AI debate, Image/Video/Voice generator, and 20 more tools to help you learn and build with AI.

Visit Vincony.com