Nuvepro - Task Intelligence for the Enterprise
xAI· Engineering· Palo Alto, CA

Member Of Technical Staff - Cloud Infrastructure

Comp$180,000 – $440,000

Classified Tasks (11)

Automate 0%Augment 82%Human-Only 18%

Augment (9)

AI assists, human decides

Design, build, and operate secure, scalable infrastructure for critical US government projects across bare metal, classified cloud, and hybrid cloud environments

technical

Develop and manage training and inference clusters to support large-scale AI workloads

technical

Develop and optimize software to provision and manage infrastructure across on-premise, virtual machine, and classified cloud environments to enable efficient scaling for government initiatives

technical

Design, configure, and maintain Kubernetes stacks and GPU hardware (including CNI, CRI, CSI components) to support large-scale AI workloads and meet federal compliance

technical

Enhance infrastructure reliability, performance, and cost-effectiveness for large-scale AI and application workloads in secure, classified settings

operational

Implement and maintain observability and monitoring systems to ensure integrity and availability of critical systems in accordance with federal protocols

technical

Implement and maintain security practices to ensure confidentiality and compliance of critical systems according to federal protocols

operational

Manage storage infrastructure using Infrastructure-as-Code tools such as Pulumi, Terraform, or Ansible with a focus on secure data handling

technical

Operate and maintain highly reliable applications across bare metal, classified cloud, and hybrid cloud architectures

operational

Human-Only (2)

Requires human judgment

Collaborate with engineers to gather workload requirements and design tailored solutions that meet government-specific needs and compliance standards

communication

Execute incident management processes, conduct postmortems, and define and enforce SLAs and SLOs to drive system reliability while maintaining security and compliance

operational

Job description

ABOUT xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. ABOUT THE ROLE: We are seeking a highly skilled Senior Infrastructure Engineer to join our US Government Team, focused on designing, building, and operating secure, scalable infrastructure for critical government projects. In this role, you will develop and manage training and inference clusters, as well as highly reliable applications, across bare metal, classified cloud, and hybrid cloud architectures. You will leverage your expertise in Kubernetes and GPU hardware to deliver robust, secure systems that support large-scale AI workloads while meeting stringent federal compliance requirements. This role demands a passion for automation, observability, and ensuring system integrity in a fast-paced, high-security environment. RESPONSIBILITIES: Develop and optimize software to provision and manage xAI’s infrastructure across on-premise, virtual machine, and classified cloud environments, enabling efficient scaling for US government initiatives. Enhance the reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure, classified settings. Collaborate with xAI engineers to understand workload requirements and design tailored solutions that meet government-specific needs and compliance standards. Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems, adhering to federal protocols. Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible, with a focus on secure data handling. Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs, while maintaining security and compliance. This is an in-person role based in Palo Alto, CA or Washington, DC, with up to 50% travel required. BASIC QUALIFICATIONS: Active Top Secret (TS) security clearance. 5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role, with a focus on building and maintaining reliable, scalable systems, preferably in secure or government environments. Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible. Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components. Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs. Excellent communication and documentation skills, with the ability to handle sensitive information concisely and accurately. PREFERRED SKILLS AND EXPERIENCE:
Source: xAI careers · scraped 2026-05-22
Apply at xAI