Infrastructure Engineer — Training Clusters

xAI·Memphis, TN

DevOps/MLOpsSeniorFull-time
$180K-$300KPosted 1 months ago

About the Role

xAI is hiring an Infrastructure Engineer to manage and scale the massive GPU clusters powering Grok's training. You will work on hardware provisioning, network optimization, storage systems, and reliability engineering for one of the largest AI training facilities.

Requirements

  • 5+ years of infrastructure or SRE experience
  • Experience managing large-scale GPU/HPC clusters
  • Strong Linux systems administration skills
  • Knowledge of high-speed networking (InfiniBand, RoCE)
  • Experience with configuration management and automation

Nice to Have

  • Experience with NVIDIA H100/B200 GPU clusters
  • Background in data center operations
  • Familiarity with Slurm or similar job schedulers
  • Experience with storage systems for ML (Lustre, GPFS)

Benefits

Significant equity
Health and wellness benefits
Relocation to Memphis supported
On-site amenities
Cutting-edge hardware access
Flexible schedule

Skills

InfrastructureGPU ClustersLinuxNetworkingSREAutomation

Related Jobs

Preparing for Your AI Career?

Vincony has all 400+ AI models in one place — compare responses, AI debate, Image/Video/Voice generator, and 20 more tools to help you learn and build with AI.

Visit Vincony.com