Primary value learned value (PVLV)
The primary value learned value (PVLV) model is a possible explanation for the reward-predictive firing properties of dopamine (DA) neurons.[1] It simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards. It is an alternative to the temporal-differences (TD) algorithm.

The PVLV system simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards (an alternative to TD). It is described in this paper: O'Reilly, Frank, Hazy & Watz, 2007. The current version (described here) is a bit different from the originally published version, to make it fit better with the known biology, as described in this new paper: Hazy, Frank & O'Reilly, 2010.

A PVLV model can be created from any existing network using the LeabraWizard -- under the Networks menu.

The key idea is that there are three separate (but interacting) systems that together drive the dopaminergic system to exhibit transient changes in firing that reflect when to learn about important events. Important here is defined as something novel, or reliably associated with either a positive or negative outcome.

Current PVLV system (as of 8/2007) with biological areas. CNA = central nucleus of the amygdala; VTA = Ventral Tegmental Area, SNc = Substantia Nigra Pars Compacta (Dopamine areas); LHA = Lateral Hypothalamus; VS = Ventral Striatum (patch-like neurons); SC = Superior Colliculus; CB = Cerebellum; CTX = Neocortex (sensory areas)
  • PV (Primary Value) -- signals unexpected primary rewards and punishments. The PVe (excitatory) system communicates primary values (actual delivered rewards and punishments), while the PVi (inhibitory) system learns to expect these PVe values, and cancels them out (using simple delta-rule or Rescorla Wagner learning).
    • PV input to DA = PV_{\delta} = PV_e - PV_i
    • PVi learning: \Delta w_i = (PV_e - PV_i) x_i for sending unit x_i
  • LV (Learned Value) -- signals reward/punishment associations of Conditioned Stimuli (CS) when they are first activated, typically in advance of actual primary values. Two different rates of learning and excitatory (LVe, fast) vs. inhibitory (LVi, slow) are combined to produce a delta signal that reflects recent changes in reward association against the slowly adapting LVi baseline. Learning is also delta-rule, but only when primary rewards are present or expected (see below).
    • LV input to DA = LV_{\delta} = LV_e - LV_i
    • LV learning: \Delta w_i = (PV_e - LV) x_i only when PVe active or expected (see PV_{filter} below)
  • NV (Novelty Value) -- signals stimulus novelty and produces positive dopamine bursts for novel stimuli, which slowly decay in magnitude as a stimulus becomes familiar.
    • NV input to DA = NV_{\delta} = NV
    • NV learning: \Delta w_i = - NV x_i

The DA (dopamine) system (VTA and SNc) integrates each of these inputs, using a temporal derivative computation to only produce brief bursts or dips relative to a baseline level of activation (note that this temporal derivative in PVLV is the major difference from the synaptic depression mechanism used in the earlier published version). The key issue is when to use each of the above values: If primary rewards are present or expected but not present, then the PV system dominates, and otherwise LV + NV drive it.

The challenge is to compute when primary rewards are expected but not present. In the current version, there is a special PVr system (reward detection) system that learns just like PVi, but has a slower learning rate for weight decreases relative to increases. This allows it to remember whenexternal rewards were presented in the past, even if they are no longer being presented now, which is critical for extinction learning (PVi may go down in value but PVr still represents that a reward is expected, allowing the PV and LV systems to continue to extinguish).

There is a simple threshold applied to the PVr activity to determine if there is a significant reward expectation:

  • PV_{filter} = PVr > .8 or PVr < .2

This condition is used to determine when to train the LV system as well.

The full DA equation with temporal derivatives (t = current trial, t-1 is previous trial) is:

  •     \delta = \left\{ \begin{array}{ll} (PV_{\delta}^t - PV_{\delta}^{t-1}) &    \mbox{if} \; PV_{filter} \\    (LV_{\delta}^t - LV_{\delta}^{t-1}) + (NV_{delta}^t - NV_{delta}^{t-1}) & \mbox{otherwise} \end{array} \right.

Implementational Details

The PV,LV etc values are all represented by ScalarValLayerSpec layers, where the first unit is purely for reporting the scalar value represented as a weighted average of the remaining units. The other three units represent 0, .5, and 1. The initial bias for all the values is to represent a .5, except the NV which has a 1.0 bias (which then learns down to 0). Thus, reward is typically represented by a 1 (or any value greater than .5 if graded), and punishment by a 0.

The equations shown above are computed directly on the scalar values in hard code -- the connections among the PVLV units just serve as pointers for this code to track down the appropriate layers to read values from. Although all the computations are simple and could be computed neurally, this more abstract model makes everything much simpler and more exact.

CONTEXT(Help)
-
Machine Learning Methods & Algorithms »Machine Learning Methods & Algorithms
Reinforcement learning »Reinforcement learning
Error-driven learning »Error-driven learning
LEABRA »LEABRA
Primary value learned value (PVLV)
LEABRA »LEABRA
+Comments (0)
+Citations (0)
+About