Anthropic· AI Research & Engineering· San Francisco, CA
Research Scientist, Interpretability
Classified Tasks (9)
Automate 0%Augment 89%Human-Only 11%
Augment (8)
AI assists, human decides
Reverse-engineer trained language models to identify internal mechanisms and components
analytical
Discover mappings between neural network parameters and meaningful, interpretable algorithms
analytical
Develop and apply mechanistic interpretability techniques for transformer-based models
technical
Build and maintain analysis tools ("microscopes") to inspect model internals and behaviors
technical
Analyze neural networks as programs to reverse-engineer algorithmic components and workflows
analytical
Design and implement engineering solutions to scale interpretability methods to large models
technical
Conduct experiments and analyze results to validate mechanistic hypotheses about model behavior
analytical
Publish and communicate research findings through papers, blog posts, and presentations
communication
Human-Only (1)
Requires human judgment
Apply mechanistic insights to improve the safety, steerability, and reliability of AI systems
operational
Job description
About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the role: When you see what modern language models are capable of, do you wonder, "How do these things work? How can we trust them?" The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to make advanced systems safe. We’re looking for researchers and engineers to join our efforts. People mean many different things by "interpretability". We're focused on mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. Some useful analogies might be to think of us as trying to do "biology" or "neuroscience" of neural networks using “microscopes” we build, or as treating neural networks as binary computer programs we're trying to "reverse engineer". A few places to learn more about our work and team at a high level are this introduction to Interpretability from our research lead, Chris Olah ; a discussion of our work on the Hard Fork podcast produced by the New York Times, and this blog post (and accompanying video) sharing more about some of the engineering challenges we’d had to solve to get these results. Some of our team's notable publications include A Mathematical Framework for Transformer Circuits , In-context Learning and Induction Heads , Toy Models of Superposition , Scaling Monosemanticity , and our Circuits’ Methods and Biology papers. This work builds on ideas from members' work prior to Anthropic such as the original circuits thread , Multimodal Neurons , Activation Atlases , and Building Blocks . We aim