A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents
We develop cognitive map probes recovering an agent's approximate beliefs about its environment, and use them to prototype white-box evaluations explaining suboptimal actions in light of the agent's imperfect beliefs in simple 2D gridworlds.