Research Scientist, Interpretability at Anthropic — task breakdown

Research Scientist, Interpretability

Classified Tasks (9)

Automate 0%Augment 89%Human-Only 11%

Augment (8)

AI assists, human decides

Reverse-engineer trained language models to identify internal mechanisms and components

analytical

Discover mappings between neural network parameters and meaningful, interpretable algorithms

analytical

Develop and apply mechanistic interpretability techniques for transformer-based models

technical

Build and maintain analysis tools ("microscopes") to inspect model internals and behaviors

technical

Analyze neural networks as programs to reverse-engineer algorithmic components and workflows

analytical

Design and implement engineering solutions to scale interpretability methods to large models

technical

Conduct experiments and analyze results to validate mechanistic hypotheses about model behavior

analytical

Publish and communicate research findings through papers, blog posts, and presentations

communication

Human-Only (1)

Requires human judgment

Apply mechanistic insights to improve the safety, steerability, and reliability of AI systems

operational

Job description

About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the role: When you see what modern language models are capable of, do you wonder, "How do these things work? How can we trust them?" The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to make advanced systems safe. We’re looking for researchers and engineers to join our efforts. People mean many different things by "interpretability". We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. Some useful analogies might be to think of us as trying to do "biology" or "neuroscience" of neural networks using “microscopes” we build, or as treating neural networks as binary computer programs we're trying to "reverse engineer". A few places to learn more about our work and team at a high level are this introduction to Interpretability from our research lead, Chris Olah ; a discussion of our work on the Hard Fork podcast produced by the New York Times, and this blog post (and accompanying video) sharing more about some of the engineering challenges we’d had to solve to get these results. Some of our team's notable publications include A Mathematical Framework for Transformer Circuits , In-context Learning and Induction Heads , Toy Models of Superposition , Scaling Monosemanticity , and our Circuits’ Methods and Biology papers. This work builds on ideas from members' work prior to Anthropic such as the original circuits thread , Multimodal Neurons , Activation Atlases , and Building Blocks . We aim

Source: Anthropic careers · scraped 2026-05-22

Apply at Anthropic