Q-learning for MDPsMedium

Q-learning for MDPs

Background

Q-learning is the classic off-policy, value-based RL algorithm. It learns the optimal action-value function Q(s,a)Q(s,a) — the expected return of taking action aa in state ss and acting optimally afterward — without a model of the environment. It explores with an ϵ\epsilon-greedy policy and bootstraps each update from the current best estimate of the next state's value.

Problem statement

Implement q_learning(num_states, num_actions, P, R, terminal_states, alpha, gamma, epsilon, num_episodes). Initialize Q to zeros and, for each episode, start from a random non-terminal state and step until reaching a terminal state, applying:

Q(s,a)+=α(r+γmaxaQ(s,a)[sterminal]Q(s,a))Q(s,a) \mathrel{+}= \alpha\big(r + \gamma\max_{a'}Q(s',a')\cdot[\,s'\notin\text{terminal}\,] - Q(s,a)\big)

Use ϵ\epsilon-greedy action selection (np.random.rand(), np.random.randint) and sample the next state from P[s, a] via np.random.choice.

Input

  • num_states, num_actionsint sizes.
  • P(num_states, num_actions, num_states) transition probabilities.
  • R(num_states, num_actions) rewards.
  • terminal_states — list of terminal state indices.
  • alpha, gamma, epsilon — learning rate, discount, exploration.
  • num_episodesint, training episodes.

Output

An np.ndarray (num_states, num_actions): the learned Q-table.

Examples

np.random.seed(42)
P = [[[0,1],[1,0]], [[1,0],[1,0]]]; R = [[1,0],[0,0]]; terminal = [1]
q_learning(2, 2, P, R, terminal, alpha=0.1, gamma=0.9, epsilon=0.1, num_episodes=10)
-> [[0.651322, 0.052902], [0.0, 0.0]]

Explanation: state 1 is terminal, so its row stays 0. State 0's values grow as updates back up the reward via the transition/reward tables over 10 episodes.

Constraints

  • Terminal-state rows must remain 0 (no actions taken from them).
  • The bootstrap target drops the γmaxQ(s)\gamma\max Q(s') term when ss' is terminal.
  • Use the standard library RNG calls (np.random.*) so a fixed seed reproduces the table.

Notes

  • Q-learning is off-policy: it learns the greedy (optimal) value while behaving with exploration, because the target uses maxaQ\max_{a'}Q, not the action actually taken next.
  • α\alpha controls how fast estimates move; with enough exploration and decaying α\alpha, QQ converges to the optimal QQ^*.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Seeded reference matches the expected Q-table
  • Terminal state row stays zero
  • Q-table has the right shape
  • Reward-1 transition to terminal learns toward that reward