Sequence padding & maskingEasy

Sequence padding & masking

Background

Models that train in batches need every sequence in a batch to be the same length, but real text/event sequences vary. The fix is padding: extend short sequences with a dummy value up to a common length (and truncate the over-long ones). A parallel mask marks which positions are real (1) versus padding (0) so downstream layers — losses, RNN steps, attention — can ignore the filler.

Problem statement

Implement pad_and_mask(sequences, max_len=None, pad_value=0) that post-pads/truncates a list of integer sequences to a common length and returns (padded, mask):

  • If max_len is None, use the length of the longest sequence.
  • Sequences longer than max_len are truncated from the end (keep the first max_len tokens).
  • Padding is appended at the end (post-padding) with pad_value.
  • mask[i, t] = 1 if position t is a real token of sequence i, else 0.

Input

  • sequences — list of lists/arrays of int tokens (variable length).
  • max_lenint or None; the target length.
  • pad_valueint used to fill padded positions.

Output

A tuple (padded, mask):

  • paddednp.ndarray of shape (num_sequences, max_len).
  • masknp.ndarray of shape (num_sequences, max_len) with values in {0, 1}.

Examples

Example 1

Input:  sequences = [[1, 2, 3], [4, 5], [6]], max_len = None
Output: padded = [[1, 2, 3],
                  [4, 5, 0],
                  [6, 0, 0]]
        mask   = [[1, 1, 1],
                  [1, 1, 0],
                  [1, 0, 0]]

Explanation: the longest sequence has length 3, so all rows are padded to 3 with pad_value=0; the mask flags the two padded cells in rows 2 and 3.

Constraints

  • Post-padding and post-truncating (operate on the tail).
  • The mask must be 1 exactly at the min(len(seq), max_len) leading positions of each row.
  • Output arrays both have shape (num_sequences, max_len).

Notes

  • The mask later lets attention set padded positions to -\infty before softmax (so they get zero weight), and lets losses skip padded targets.
  • Some pipelines pre-pad instead (filler at the front) — common for RNNs so the last real token sits at the final time step.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • Explicit max_len truncates the tail
  • Custom pad_value is used for filler
  • Mask row sums equal clamped original lengths
  • Output shapes are (num_sequences, max_len)