Computer Science MS Defense - Caleb Carr

Name: Caleb Carr 

Advisor: Dr. Chris Sims 

Disentangling Representation and Policy Complexity in Reinforcement Learning via Dual Mutual Information Penalties

Abstract: 

Mutual information regularization offers a principled framework for trading task performance against representational complexity in reinforcement learning, yet most treatments apply a single penalty to the full state-action channel and leave the contributions of perception and action selection entangled. We develop and evaluate a dual mutual information-regularized actor-critic algorithm in which an encoder penalty and a policy penalty are controlled independently. This is motivated by cognitive science and neuroscience evidence that biological agents operate near optimal reward-complexity frontiers and that dopaminergic signals encode information-processing costs.

We first establish a single-penalty baseline penalizing a state-action channel in two gridworld environments: an open room and a structurally demanding zoned corridor. Moderate regularization sustains near-optimal reward and full task success across both environments, while extreme regularization degrades performance. Estimated mutual information resists suppression throughout the moderate regime, consistent with the reward-complexity frontier structure predicted by the bounded rationality literature.

We then introduce an explicit categorical encoder mapping raw observations to discrete latent codes and sweep a 9×9 grid of configurations per environment. In both environments, the unregularized baseline spontaneously produces functional encoder representations without explicit regularization, and performance degrades primarily at higher MI penalties. The decomposition exposes a cross-penalty coupling invisible to the single-penalty framework. Strong policy compression can indirectly elevate encoder information use, while strong encoder compression can suppress policy information by destroying the state signal the policy exploits.

These results support the theory that dual mutual information penalties provide independent control over two qualitatively distinct stages of the perception-action pipeline. The encoder penalty's primary practical role is bounding encoder informativeness, while the policy penalty governs behavioral stochasticity in a manner consistent with the single-penalty baseline. Together, the two penalties expose structure in the reward-complexity tradeoff that a single channel penalty cannot resolve.

 

Date
Location
Sage 4112
Back to top