Skip to content

Datasets

ExactStep provides synthetic long-range datasets:

  • generate_long_copy_dataset: copy task where targets equal inputs.
  • generate_long_bracket_depth_dataset: bracket nesting depth with distractors.
  • generate_long_reverse_dataset: reverse the token sequence along time.
  • generate_running_sum_dataset: running sum of digit tokens (optionally modulo m).

Both return (X, Y, meta) padded to common length.

from transformer_instant import generate_long_copy_dataset
X, Y, meta = generate_long_copy_dataset(num_sequences=256, max_len=1024, vocab_size=64, seed=0)

Reverse dataset example:

from transformer_instant import generate_long_reverse_dataset
X, Y, meta = generate_long_reverse_dataset(num_sequences=128, max_len=256, vocab_size=32, seed=2)

Running-sum dataset example (with modulo):

from transformer_instant import generate_running_sum_dataset
X, Y, meta = generate_running_sum_dataset(num_sequences=128, max_len=256, vocab_size=10, modulo=7, seed=3)