Steerability Mapping in Persona-Vector Space: Single-Trait and Pairwise Analysis Across Open-Weight LLMs

Abstract

Persona vectors offer a direct way to intervene on language-model behavior, but not every trait behaves like a useful control. Some traits are already strongly expressed by default, others resist intervention, and others interact unpredictably when combined. We treat this variation as informative, asking which traits are actually controllable across open-weight models and what the resulting pattern reveals. We examine 53 cross-domain traits in two open-weight models, including all pairwise interactions among a subset of 19. Controllability is strongly shaped by domain and model: competence-oriented traits tend to be defaults already, while exaggerated or undesirable styles are the most easily amplified. A targeted case study on harmful behavior further shows that nearby stylistic traits are poor proxies for harmful intent, though a direction transferred from a related fine-tuned model can still influence the base model. Together, these results argue that persona vectors are useful not only as controls but as probes of which behaviors a model exposes, hides, or resists.

Term

Spring 2026

Date

April 17, 2026

Time

3:00 - 4:00 PM

Location

White Hall 100

Abstract

Links