API: transformer_instant¶
ExactStep Transformer Toolkit (CPU)¶
Hook-based, model-agnostic utilities to integrate closed-form solves and one-shot updates with Transformer models on CPU. Includes:
- Frozen features → closed-form head (ridge / KRR / ELM)
- One-shot block collapse via linear solves + balanced SVD factorization
- One-shot LoRA via SVD (rank-r update)
- Explicit attention path wiring Q/K/V/O and solving W_O from Z = (A V)
This package reuses SPD Cholesky-backed solves with stable whitening by default.
BlockCollapseTrainer
dataclass
¶
Source code in transformer_instant/pipeline.py
61 62 63 64 65 66 67 | |
solve_linear_block(X, Y)
¶
Solve W from X W ≈ Y via ridge.
Source code in transformer_instant/pipeline.py
65 66 67 | |
apply_lora_update_inplace(W, A, B)
¶
Apply W ← W + A @ B in-place on W if shapes are compatible.
Source code in transformer_instant/lora_svd.py
29 30 31 32 33 | |
compute_qkv_attn(X, W_Q, W_K, W_V, *, mask=None)
¶
Compute Q, K, V, A=softmax(QK^T/√d), Z=A V for a single-head attention.
Shapes
- X: (n, d_model)
- W_Q: (d_model, d_k), W_K: (d_model, d_k), W_V: (d_model, d_v)
- Returns: Q(n,d_k), K(n,d_k), V(n,d_v), A(n,n), Z(n,d_v)
Source code in transformer_instant/attention_solver.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | |
generate_long_bracket_depth_dataset(num_sequences=512, min_len=64, max_len=512, max_depth=16, seed=1)
¶
Bracket-depth task: stream of brackets and distractors; target depth.
Vocabulary
- 0: EOS / padding
- 1: '('
- 2: ')'
- 3..K: distractors (letters)
We generate sequences with balanced parentheses possibly interleaved with distractors. The target Y[t] is the nesting depth after processing token t.
Source code in transformer_instant/datasets/long_datasets.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
generate_long_copy_dataset(num_sequences=512, min_len=64, max_len=512, vocab_size=32, seed=0)
¶
Copy task: predict the next token matching the current token (shifted).
Returns X (N, L) int64 tokens and Y (N, L) int64 targets where Y[t] = X[t]. We also add an EOS=0 token and ensure tokens are >=1 if vocab_size>1.
Source code in transformer_instant/datasets/long_datasets.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
generate_long_reverse_dataset(num_sequences=512, min_len=64, max_len=512, vocab_size=32, seed=2)
¶
Reverse task: targets are the input sequence reversed along time.
Returns X (N, L) int64 tokens and Y (N, L) int64 targets where Y[i, :L] = reverse(X[i, :L]). Positions beyond length are EOS=0.
Source code in transformer_instant/datasets/long_datasets.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
generate_running_sum_dataset(num_sequences=512, min_len=64, max_len=512, vocab_size=10, modulo=None, seed=3)
¶
Running-sum task: tokens are digits in [0, vocab_size-1], targets are running sum.
If modulo is provided, targets are running sum modulo modulo.
Source code in transformer_instant/datasets/long_datasets.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
krr_fit(X, Y, lambda_reg=0.001, kernel='rbf', length_scale=1.0, variance=1.0)
¶
Fit Kernel Ridge on X,Y. Returns dict with alpha and config.
For multi-output Y (n,k), alpha has shape (n,k).
Source code in transformer_instant/utils.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | |
lora_from_residual_svd(residual, rank, *, scaling=1.0)
¶
Return LoRA factors (A, B) s.t. A @ B ≈ residual (rank-r SVD approx).
Uses residual ≈ U_r Σ_r V_r^T, set A = U_r Σ_r^{1/2}, B = Σ_r^{1/2} V_r^T so that A @ B = U_r Σ_r V_r^T. Optional scalar scaling applied to both.
Source code in transformer_instant/lora_svd.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
ridge_fit_closed_form(X, Y, lambda_reg=0.001)
¶
Closed-form ridge solution W = (X^T X + λI)^{-1} X^T Y.
X: (n, d), Y: (n, k) or (n,) Returns W: (d, k) or (d,)
Source code in transformer_instant/linalg.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
solve_attention_output_projection(X, W_Q, W_K, W_V, H_attn_target, *, lambda_reg=0.001, mask=None)
¶
Return (Q,K,V,A,W_O) with W_O solving Z W_O ≈ H_attn_target.
This keeps the scaffold hook-friendly and model-agnostic.
Source code in transformer_instant/pipeline.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | |
solve_output_projection_from_attn(Z, H_attn, *, lambda_reg=0.001)
¶
Solve W_O from Z W_O ≈ H_attn via ridge in closed form.
Z: (n, d_v), H_attn: (n, d_model), returns W_O: (d_v, d_model)
Source code in transformer_instant/attention_solver.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
.. automodule:: transformer_instant
:members:
:undoc-members:
:show-inheritance:
CLI additions¶
The exactstep CLI includes a deep-elm-bench subcommand for sweeping depth/hidden/lambda and saving plots.