Active Inference Demo: T-Maze Environment

Author: Conor Heins

This demo notebook provides a full walk-through of active inference using the Agent() class of pymdp. The canonical example used here is the ‘T-maze’ task, often used in the active inference literature in discussions of epistemic behavior (see, for example, “Active Inference and Epistemic Value”)

Note: When running this notebook in Google Colab, you may have to run !pip install inferactively-pymdp at the top of the notebook, before you can import pymdp. That cell is left commented out below, in case you are running this notebook from Google Colab.

# ! pip install inferactively-pymdp

Imports

First, import pymdp and the modules we’ll need.

import os
import sys
import pathlib
import numpy as np
import copy

from pymdp.agent import Agent
from pymdp.utils import plot_beliefs, plot_likelihood
from pymdp import utils
from pymdp.envs import TMazeEnv

Auxiliary Functions

Define some utility functions that will be helpful for plotting.

Environment

Here we consider an agent navigating a three-armed ‘T-maze,’ with the agent starting in a central location of the maze. The bottom arm of the maze contains an informative cue, which signals in which of the two top arms (‘Left’ or ‘Right’, the ends of the ‘T’) a reward is likely to be found.

At each timestep, the environment is described by the joint occurrence of two qualitatively-different ‘kinds’ of states (hereafter referred to as hidden state factors). These hidden state factors are independent of one another.

The first hidden state factor (Location) is a discrete random variable with 4 levels, that encodes the current position of the agent. Each of its four levels can be mapped to the following values: {CENTER, RIGHT ARM, LEFT ARM, or CUE LOCATION}. The random variable Location taking a particular value can be represented as a one-hot vector with a 1 at the appropriate level, and 0’s everywhere else. For example, if the agent is in the CUE LOCATION, the current state of this factor would be $s_1 = \begin{bmatrix} 0 & 0 & 0 & 1 \end{bmatrix}$.

We represent the second hidden state factor (Reward Condition) is a discrete random variable with 2 levels, that encodes the reward condition of the trial: {Reward on Right, or Reward on Left}. A trial where the condition is reward is Reward on Left is thus encoded as the state $s_2 = \begin{bmatrix} 0 & 1\end{bmatrix}$.

The environment is designed such that when the agent is located in the RIGHT ARM and the reward condition is Reward on Right, the agent has a specified probability $a$ (where $a > 0.5$) of receiving a reward, and a low probability $b = 1 - a$ of receiving a ‘loss’ (we can think of this as an aversive or unpreferred stimulus). If the agent is in the LEFT ARM for the same reward condition, the reward probabilities are swapped, and the agent experiences loss with probability $a$, and reward with lower probability $b = 1 - a$. These reward contingencies are intuitively swapped for the Reward on Left condition.

For instance, we can encode the state of the environment at the first time step in a Reward on Right trial with the following pair of hidden state vectors: $s_1 = \begin{bmatrix} 1 & 0 & 0 & 0\end{bmatrix}$, $s_2 = \begin{bmatrix} 1 & 0\end{bmatrix}$, where we assume the agent starts sitting in the central location. If the agent moved to the right arm, then the corresponding hidden state vectors would now be $s_1 = \begin{bmatrix} 0 & 1 & 0 & 0\end{bmatrix}$, $s_2 = \begin{bmatrix} 1 & 0\end{bmatrix}$. This highlights the independence of the two hidden state factors. In other words, the location of the agent ($s_1$) can change without affecting the identity of the reward condition ($s_2$).

Initialize environment

Now we can initialize the T-maze environment using the built-in TMazeEnv class from the pymdp.envs module.

Choose reward probabilities $a$ and $b$, where $a$ and $b$ are the probabilities of reward / loss in the ‘correct’ arm, and the probabilities of loss / reward in the ‘incorrect’ arm. Which arm counts as ‘correct’ vs. ‘incorrect’ depends on the reward condition (state of the 2nd hidden state factor).

reward_probabilities = [0.98, 0.02] # probabilities used in the original SPM T-maze demo

Initialize an instance of the T-maze environment

env = TMazeEnv(reward_probs = reward_probabilities)

Structure of the state –> outcome mapping

We can ‘peer into’ the rules encoded by the environment (also known as the generative process ) by looking at the probability distributions that map from hidden states to observations. Following the SPM version of active inference, we refer to this collection of probabilistic relationships as the A array. In the case of the true rules of the environment, we refer to this array as A_gp (where the suffix _gp denotes the generative process).

It is worth outlining what constitute the agent’s observations in this task. In this T-maze demo, we have three sensory channels or observation modalities: Location, Reward, and Cue.

The Location observation values are identical to the Location hidden state values. In this case, the agent always unambiguously observes its own state - if the agent is in RIGHT ARM, it receives a RIGHT ARM observation in the corresponding modality. This might be analogized to a ‘proprioceptive’ sense of one’s own place.
The Reward observation modality assumes the values No Reward, Reward or Loss. The No Reward (index 0) observation is observed whenever the agent isn’t occupying one of the two T-maze arms (the right or left arms). The Reward (index 1) and Loss (index 2) observations are observed in the right and left arms of the T-maze, with associated probabilities that depend on the reward condition (i.e. on the value of the second hidden state factor).
The Cue observation modality assumes the values Cue Right, Cue Left. This observation unambiguously signals the reward condition of the trial, and therefore in which arm the Reward observation is more probable. When the agent occupies the other arms, the Cue observation will be Cue Right or Cue Left with equal probability. However (as we’ll see below when we intialise the agent), the agent’s beliefs about the likelihood mapping render these observations uninformative and irrelevant to state inference.

In pymdp, we store the set of probability distributions encoding the conditional probabilities of observations, under different configurations of hidden states, as a set of matrices referred to as the likelihood mapping or A array (this is a convention borrowed from SPM). The likelihood mapping for a single modality is stored as a single multidimensional array (a numpy.ndarray) A[m] with the larger likelihood array, where m is the index of the corresponding modality. Each modality-specific A array has num_obs[m] rows, and as many lagging dimensions (e.g. columns, ‘slices’ and higher-order dimensions) as there are hidden state factors. num_obs[m] tells you the number of observation values (also known as “levels”) for observation modality m.

# here, we can get the likelihood mapping directly from the environmental class. So this is the likelihood mapping that truly describes the relatinoship between the 
# environment's hidden state and the observations the agent will get

A_gp = env.get_likelihood_dist()

plot_likelihood(A_gp[1][:,:,0],'Reward Right')

../_images/2815e73c85253099317a1ae2b8ac86b7833c95a5902a4cba07154a00b7a0a1c7.png

plot_likelihood(A_gp[1][:,:,1],'Reward Left')

../_images/2e8a8b0725177da090a239afc7a7f971288b2e58bb0dc7ca00500df79d2d5e5d.png

plot_likelihood(A_gp[2][:,3,:],'Cue Mapping')

../_images/62b4921b4ff02c3a52cd27337c7576c24a3fd6a59b65787fc34ed5b46249e9df.png

Transition Dynamics

We represent the dynamics of the environment (e.g. changes in the location of the agent and changes to the reward condition) as conditional probability distributions that encode the likelihood of transitions between the states of a given hidden state factor. These distributions are collected into the so-called B array, also known as transition likelihoods or transition distribution . As with the A array, we denote the true probabilities describing the environmental dynamics as B_gp. Each sub-matrix B_gp[f] of the larger array encodes the transition probabilities between state-values of a given hidden state factor with index f. These matrices encode dynamics as Markovian transition probabilities, such that the entry $i,j$ of a given matrix encodes the probability of transition to state $i$ at time $t+1$, given state $j$ at $t$.

# here, we can get the transition mapping directly from the environmental class. So this is the transition mapping that truly describes the environment's dynamics

B_gp = env.get_transition_dist()

For example, we can inspect the ‘dynamics’ of the Reward Condition factor by indexing into the appropriate sub-matrix of B_gp

plot_likelihood(B_gp[1][:,:,0],'Reward Condition Transitions')

../_images/d6082d5b7d1a30ef91da64d283c6a369e06a9919df3579612891cc4fe802d0bb.png

The above transition array is the ‘trivial’ identity matrix, meaning that the reward condition doesn’t change over time (it’s mapped from whatever it’s current value is to the same value at the next timestep).

(Controllable-) Transition Dynamics

Importantly, some hidden state factors are controllable by the agent, meaning that the probability of being in state $i$ at $t+1$ isn’t merely a function of the state at $t$, but also of actions (or from the agent’s perspective, control states ). So now each transition likelihood encodes conditional probability distributions over states at $t+1$, where the conditioning variables are both the states at $t-1$ and the actions at $t-1$. This extra conditioning on actions is encoded via an optional third dimension to each factor-specific B array.

For example, in our case the first hidden state factor (Location) is under the control of the agent, which means the corresponding transition likelihoods B[0] are index-able by both previous state and action.

plot_likelihood(B_gp[0][:,:,0],'Transition likelihood for "Move to Center"')

../_images/9359f09bc569ebee645ecfef001799d257e914e7cd77a764a75a8d1b6442c8c5.png

plot_likelihood(B_gp[0][:,:,1],'Transition likelihood for "Move to Right Arm"')

../_images/9a3e79192012610b79cde20994c41f2d31babe2099d4d997334d6f7192ec9048.png

plot_likelihood(B_gp[0][:,:,2],'Transition likelihood for "Move to Left Arm"')

../_images/9d93ab81a9f781fb3f8362274c42f00540931bad324ba4288add2fd651f5af3c.png

plot_likelihood(B_gp[0][:,:,3],'Transition likelihood for "Move to Cue Location"')

../_images/2124f88072e9ca08126cc2cb7eca134405aeebc473fb2e6ec87111e1d345386d.png

The generative model

Now we can move onto setting up the generative model of the agent - namely, the agent’s beliefs or assumptions about how hidden states give rise to observations, and how hidden states transition among eachother.

In almost all MDPs, the critical building blocks of this generative model are the agent’s representation of the observation likelihood, which we’ll refer to as A_gm, and its representation of the transition likelihood, which we’ll call B_gm.

Here, we assume the agent has a veridical representation of the rules of the T-maze (namely, how hidden states cause observations) as well as its ability to control its own movements with certain consequences (i.e. ‘noiseless’ transitions). So in other words, the agent will have a veridical representation of the “rules” of the environment, as encoded in the arrays A_gp and B_gp of the generative process.

A_gm = copy.deepcopy(A_gp) # make a copy of the true observation likelihood to initialize the observation model
B_gm = copy.deepcopy(B_gp) # make a copy of the true transition likelihood to initialize the transition model

Important Concept to Note Here!!!

It is not necessary, or even in many cases important , that the generative model is a veridical representation of the generative process. This distinction between generative model (essentially, beliefs entertained by the agent and its interaction with the world) and the generative process (the actual dynamical system ‘out there’ generating sensations) is of crucial importance to the active inference formalism and (in our experience) often overlooked in code.

It is for notational and computational convenience that we encode the generative process using A and B matrices. By doing so, it simply puts the rules of the environment in a data structure that can easily be converted into the Markovian-style conditional distributions useful for encoding the agent’s generative model.

Strictly speaking, however, all the generative process needs to do is generate observations that are ‘digestible’ by the agent, and be ‘perturbable’ by actions issued by the agent. The way in which it does so can be arbitrarily complex, non-linear, and unaccessible by the agent. Namely, it doesn’t have to be encoded by A and B arrays (what amount to Markovian, conditional probability tables), but could be described by arbitrarily complex nonlinear or non-differentiable transformations of hidden states that generate observatons.

Introducing the `Agent()` class

In pymdp, we have abstracted much of the computations required for active inference into the Agent class, a flexible object that can be used to store necessary aspects of the generative model, the agent’s instantaneous observations and actions, and perform action / perception using functions like Agent.infer_states and Agent.infer_policies.

An instance of Agent is straightforwardly initialized with a call to the Agent() constructor with a list of optional arguments.

In our call to Agent(), we need to constrain the default behavior with some of our T-Maze-specific needs. For example, we want to make sure that the agent’s beliefs about transitions are constrained by the fact that it can only control the Location factor - not the Reward Condition (which we assumed stationary across an epoch of time). Therefore we specify this using a list of indices that will be passed as the control_fac_idx argument of the Agent() constructor.

Each element in the list specifies a hidden state factor (in terms of its index) that is controllable by the agent. Hidden state factors whose indices are not in this list are assumed to be uncontrollable.

controllable_indices = [0] # this is a list of the indices of the hidden state factors that are controllable

Now we can construct our agent…

agent = Agent(A=A_gm, B=B_gm, control_fac_idx=controllable_indices)

Now we can inspect properties (and change) of the agent as we see fit. Let’s look at the initial beliefs the agent has about its starting location and reward condition, encoded in the prior over hidden states $P(s)$, known in SPM-lingo as the D array.

plot_beliefs(agent.D[0],"Beliefs about initial location")

../_images/30b44a6469c98702b1d59290f16a6628c017728dcf160ec68a88d45a2bb11aae.png

plot_beliefs(agent.D[1],"Beliefs about reward condition")

../_images/91967d9e56646c5cda2a5a9baf46e70ccea4287a9c5b9c8cbcb89ea5f159a05c.png

Let’s make it so that agent starts with precise and accurate prior beliefs about its starting location.

agent.D[0] = utils.onehot(0, agent.num_states[0])

The onehot(value, dimension) function within pymdp.utils is a nice function for quickly generating a one-hot vector of dimensions dimension with a 1 at the index value.

And now confirm that our agent knows (i.e. has accurate beliefs about) its initial state by visualizing its priors again.

plot_beliefs(agent.D[0],"Beliefs about initial location")

../_images/aa8e9d67b7d95460d7e40d838dc74eab701377b538c32c2880dc294bc2b2a216.png

Another thing we want to do in this case is make sure the agent has a ‘sense’ of reward / loss and thus a motivation to be in the ‘correct’ arm (the arm that maximizes the probability of getting the reward outcome).

We can do this by changing the prior beliefs about observations, the C array (also known as the prior preferences ). This is represented as a collection of distributions over observations for each modality. It is initialized by default to be all 0s. This means agent has no preference for particular outcomes. Since the second modality (index 1 of the C array) is the Reward modality, with the index of the Reward outcome being 1, and that of the Loss outcome being 2, we populate the corresponding entries with values whose relative magnitudes encode the preference for one outcome over another (technically, this is encoded directly in terms of relative log-probabilities).

Our ability to make the agent’s prior beliefs that it tends to observe the outcome with index 1 in the Reward modality, more often than the outcome with index 2, is what makes this modality a Reward modality in the first place – otherwise, it would just be an arbitrary observation with no extrinsic value per se.

agent.C[1][1] = 3.0
agent.C[1][2] = -3.0

plot_beliefs(agent.C[1],"Prior beliefs about observations")

../_images/16bf737ee87338ac9f92cfe0eb841730b51b642bd6f2779458e82e5fababfc4b.png

Active Inference

Now we can start off the T-maze with an initial observation and run active inference via a loop over a desired time interval.

T = 5 # number of timesteps

obs = env.reset() # reset the environment and get an initial observation

# these are useful for displaying read-outs during the loop over time
reward_conditions = ["Right", "Left"]
location_observations = ['CENTER','RIGHT ARM','LEFT ARM','CUE LOCATION']
reward_observations = ['No reward','Reward!','Loss!']
cue_observations = ['Cue Right','Cue Left']
msg = """ === Starting experiment === \n Reward condition: {}, Observation: [{}, {}, {}]"""
print(msg.format(reward_conditions[env.reward_condition], location_observations[obs[0]], reward_observations[obs[1]], cue_observations[obs[2]]))

for t in range(T):
    qx = agent.infer_states(obs)

    q_pi, efe = agent.infer_policies()

    action = agent.sample_action()

    msg = """[Step {}] Action: [Move to {}]"""
    print(msg.format(t, location_observations[int(action[0])]))

    obs = env.step(action)

    msg = """[Step {}] Observation: [{},  {}, {}]"""
    print(msg.format(t, location_observations[obs[0]], reward_observations[obs[1]], cue_observations[obs[2]]))

 === Starting experiment === 
 Reward condition: Right, Observation: [CENTER, No reward, Cue Left]
[Step 0] Action: [Move to CUE LOCATION]
[Step 0] Observation: [CUE LOCATION,  No reward, Cue Right]
[Step 1] Action: [Move to RIGHT ARM]
[Step 1] Observation: [RIGHT ARM,  Reward!, Cue Right]
[Step 2] Action: [Move to RIGHT ARM]
[Step 2] Observation: [RIGHT ARM,  Reward!, Cue Left]
[Step 3] Action: [Move to RIGHT ARM]
[Step 3] Observation: [RIGHT ARM,  Reward!, Cue Right]
[Step 4] Action: [Move to RIGHT ARM]
[Step 4] Observation: [RIGHT ARM,  Reward!, Cue Left]

The agent begins by moving to the CUE LOCATION to resolve its uncertainty about the reward condition - this is because it knows it will get an informative cue in this location, which will signal the true reward condition unambiguously. At the beginning of the next timestep, the agent then uses this observaiton to update its posterior beliefs about states qx[1] to reflect the true reward condition. Having resolved its uncertainty about the reward condition, the agent then moves to RIGHT ARM to maximize utility and continues to do so, given its (correct) beliefs about the reward condition and the mapping between hidden states and reward observations.

Notice, perhaps confusingly, that the agent continues to receive observations in the 3rd modality (i.e. samples from A_gp[2]). These are observations of the form Cue Right or Cue Left. However, these ‘cue’ observations are random and totally umambiguous unless the agent is in the CUE LOCATION - this is reflected by totally entropic distributions in the corresponding columns of A_gp[2] (and the agents beliefs about this ambiguity, reflected in A_gm[2]. See below.

plot_likelihood(A_gp[2][:,:,0],'Cue Observations when condition is Reward on Right, for Different Locations')

../_images/29c26fd88d8634f74148e4fb59512ac81d42a2db5aece3d68b47043be9c4667f.png

plot_likelihood(A_gp[2][:,:,1],'Cue Observations when condition is Reward on Left, for Different Locations')

../_images/2712de0922e5dedbf63ea1ba5e6d4e4b6c040ce756233d4d72a1ca511b5becaf.png

The final column on the right side of these matrices represents the distribution over cue observations, conditioned on the agent being in CUE LOCATION and the appropriate Reward Condition. This demonstrates that cue observations are uninformative / lacking epistemic value for the agent, unless they are in CUE LOCATION.

Now we can inspect the agent’s final beliefs about the reward condition characterizing the ‘trial,’ having undergone 10 timesteps of active inference.

plot_beliefs(qx[1],"Final posterior beliefs about reward condition")

../_images/fdab78064e1887dff6e3ce3f488ce982ae4fedef6865c158a51c899436a6947d.png