Q-learning for MDPs
Background
Q-learning is the classic off-policy, value-based RL algorithm. It learns the optimal action-value function — the expected return of taking action in state and acting optimally afterward — without a model of the environment. It explores with an -greedy policy and bootstraps each update from the current best estimate of the next state's value.
Problem statement
Implement q_learning(num_states, num_actions, P, R, terminal_states, alpha, gamma, epsilon, num_episodes). Initialize Q to zeros and, for each episode, start from a random non-terminal state and step until reaching a terminal state, applying:
Use -greedy action selection (np.random.rand(), np.random.randint) and sample the next state from P[s, a] via np.random.choice.
Input
num_states,num_actions—intsizes.P—(num_states, num_actions, num_states)transition probabilities.R—(num_states, num_actions)rewards.terminal_states— list of terminal state indices.alpha,gamma,epsilon— learning rate, discount, exploration.num_episodes—int, training episodes.
Output
An np.ndarray (num_states, num_actions): the learned Q-table.
Examples
np.random.seed(42)
P = [[[0,1],[1,0]], [[1,0],[1,0]]]; R = [[1,0],[0,0]]; terminal = [1]
q_learning(2, 2, P, R, terminal, alpha=0.1, gamma=0.9, epsilon=0.1, num_episodes=10)
-> [[0.651322, 0.052902], [0.0, 0.0]]
Explanation: state 1 is terminal, so its row stays 0. State 0's values grow as updates back up the reward via the transition/reward tables over 10 episodes.
Constraints
- Terminal-state rows must remain 0 (no actions taken from them).
- The bootstrap target drops the term when is terminal.
- Use the standard library RNG calls (
np.random.*) so a fixed seed reproduces the table.
Notes
- Q-learning is off-policy: it learns the greedy (optimal) value while behaving with exploration, because the target uses , not the action actually taken next.
- controls how fast estimates move; with enough exploration and decaying , converges to the optimal .
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Seeded reference matches the expected Q-table
- •Terminal state row stays zero
- •Q-table has the right shape
- •Reward-1 transition to terminal learns toward that reward