Research Engineer, Data Infrastructure
Classified Tasks (19)
Augment (16)
AI assists, human decides
Build specialized compute fabrics to power model development and training workloads
technical
Build specialized data fabrics to support large-scale model training and fine-tuning
technical
Design and build data lakes and metadata systems aimed at exabyte-scale architecture
technical
Develop a high-performance training platform for on-premise and cloud-native Kubernetes environments
technical
Migrate legacy scheduling systems to modern orchestration frameworks
technical
Architect and maintain multi-cluster orchestration layers to optimize workload placement across diverse hardware and regions
technical
Implement cloud-bursting capabilities to utilize global resources across clusters and regions
technical
Provision and configure distributed compute clusters to provide seamless researcher access to compute resources
operational
Design and implement decoupled control and data plane architectures for scalable systems
technical
Scale distributed compute and storage systems to meet operational capacity and performance goals
operational
Develop and maintain the internal training platform to enable model training and fine-tuning across Kubernetes and SLURM environments
technical
Implement production-grade data and training pipelines for model development workflows
technical
Implement and manage metadata and lineage systems to provide visibility and traceability across data and model pipelines
technical
Design and operate modern deployment workflows for cloud-native deployments to ensure platform scalability, reliability, and efficiency
operational
Enforce secure and governed data access controls for MLOps and research use cases
operational
Operate and manage large distributed compute fleets in production
operational
Human-Only (3)
Requires human judgment
Architect the backbone of the model training and fine-tuning infrastructure for frontier AI development
technical
Architect the transition to modern storage formats to handle large fine-tuning datasets and anticipated exabyte growth
technical
Participate in on-call rotations to support and troubleshoot critical training jobs
operational