Nuvepro - Task Intelligence for the Enterprise
xAI· Engineering· Memphis, TN

Member of Technical Staff

Classified Tasks (11)

Automate 0%Augment 82%Human-Only 18%

Augment (9)

AI assists, human decides

Manage and enhance reliability across a multi-data center environment

operational

Design, develop, and deploy scalable code and services (primarily in Python and Rust) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning

technical

Implement and maintain observability tools and practices, including metrics collection, logging, tracing, and dashboards to provide real-time system health insights across multiple data centers

technical

Build and implement robust observability solutions for mission-critical AI infrastructure

technical

Collaborate with software development, network engineering, site operations, and facility operations to identify reliability bottlenecks and automate solutions for fault tolerance

communication

Automate operational processes to drive efficiency and reduce manual intervention

operational

Optimize system performance and scalability for AI workloads

technical

Ensure seamless operations and minimize downtime for mission-critical AI infrastructure

operational

Implement proactive monitoring and automated remediation to reduce mean time to recovery (MTTR)

technical

Human-Only (2)

Requires human judgment

Partner with critical facilities, mechanical/electrical teams, and data center infrastructure management to address physical infrastructure impacts on reliability

operational

Mitigate downtime and minimize impact to end-users from scheduled and unscheduled maintenance and onsite data center events

operational

Job description

ABOUT xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. ABOUT THE ROLE: We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure. The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime—including close partnership with facility operations to address physical infrastructure impacts. If you thrive in lightning-fast, distributed environments and are passionate about leveraging automation to drive efficiency, this is an opportunity to make a significant impact on our infrastructure's resilience and scalability. In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities. By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers. The primary objective of this team is to mitigate downtime and minimize impact to end-users from both scheduled and unscheduled maintenance, as well as events affecting onsite data centers. This is achieved through proactive automation, robust observability, and integrated software-physical reliability strategies, ensuring our AI infrastructure remains resilient, scalable, and at the cutting edge of innovation. RESPONSIBILITIES: Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning. We value adaptability to new tools and paradigms in the fast-evolving AI space. Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers—open to innovative stacks beyond traditional ones like ELK. Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management)—to identify reliability bottlenecks, automate solutions for fault tolerance
Source: xAI careers · scraped 2026-05-22
Apply at xAI