Skip to main content

1712. v0.12.4 performance parity

Ground rule

Port full subsystems / files one by one. No partial slices, no name-only shims, no "patch the gate and move on". When a phase here touches a CPython source file, every function in that file lands in the corresponding gopy package with a // CPython: citation before the phase flips to DONE. The cost of revisiting a half-ported subsystem is always higher than the cost of finishing it the first time. This rule overrides any pressure to ship a row green early.

Why this spec exists

A 10-line pyperformance smoke ran on the v0.12.4 branch shows gopy between 8x and 40x slower than python3.14 on the same .py source. The first warm-up run (see "Current benchmark results" below) puts geomean at ~283x cpython, with three benchmarks failing outright.

That gap is not Go vs C cost. The gap is structural: gopy has shipped most of the performance machinery (specializer at ~3500 LOC under specialize/, tier-2 uops at ~23k LOC under optimizer/, small-int cache, dict split-keys, generator, float, slot tables) but the machinery is either not wired into the eval loop, gated behind a flag nothing flips, or stops short of the dispatch paths the benchmarks actually take.

This spec is the umbrella that drives the audit + wire-up + the remaining ports to the point where gopy clears pyperformance within 1.5x of cpython on geomean, and within 5x on every individual benchmark in the small-subset gate.

2026-05-19 reality-check audit update. Five parallel CPython 3.14-vs-gopy audits (P1, P2, P3/P5/P7, P6/P8/P9/P10/P11, P4/P12/P13/P14/P15) corrected several claims in the original draft of this spec. Highlights:

  • P1 (specializer) is no longer the smoking gun. Cache-cell emission + specialize.Enable wiring + deopt + adaptive tick all landed in commit 67abc0a. The remaining P1 work is closing the per-family emission/dispatch tables (LOAD_ATTR WITH_HINT/METHOD_WITH_VALUES, STORE_ATTR INSTANCE_VALUE/WITH_HINT, CALL BUILTIN_*, FOR_ITER, SEND, LOAD_SUPER_ATTR), plus persisting Code.Quickened through marshal.
  • P2 (tier-2) is gated off, not partially built. The projection/analysis/executor scaffolding is mostly ported (~13.5k LOC under optimizer/, not the ~23k earlier estimate), but interp.JIT is hardcoded false, so no executor ever runs. Of 14 hand-ported uops, only 3 (_LOAD_FAST, _STORE_FAST, _CHECK_VALIDITY) are actual hot-path targets; the remaining 11 are scaffolding (_NOP, _EXIT_TRACE, _JUMP_TO_TOP, etc.). Python/optimizer_bytecodes.c (1107 LOC) is entirely unported, so optimize_uops() is stubbed.
  • P5 (dict) is misdiagnosed. objects/dict.go is already an open-addressed table (entries []dictEntry, order []int), not map[any]any + order slice as the draft claimed. The real gaps are: split-keys saves zero memory, no PyDict_Watch subscription API, no _PyDict_SetItem_KnownHash skip-rehash path.
  • P6.2 (LOAD_FAST_CHECK) is DONE. Shipped via spec 1716 (compile/flowgraph_cfg_locals.go:320-358 rewrites LOAD_FAST → LOAD_FAST_CHECK; vm/eval_dispatch_handwritten.go:63-72 dispatches). Frame pool, LOAD_FAST_BORROW, STORE_FAST_STORE_FAST, args-tuple bypass remain.
  • P11 (CFG optimizer + peephole) is FULLY CLOSED. Shipped via spec 1716 (commits 9d7d9f0 + 37563f5). Jump threading, unreachable-block elimination, redundant-jump removal, constant folding, peephole rewrites all in compile/flowgraph_cfg_passes.go.
  • P12 (generator) is already complete. gopy uses a goroutine
    • channel model that avoids frame copies entirely. The draft's "per-send frame copy cost" diagnosis was incorrect.
  • P13 (GC) is ~90% done. Tracking machinery, gc.get_objects, gc.get_referrers, gc.get_referents, gc.is_tracked all ported. Gap: gc.set_threshold() doesn't drive collections, and gc.collect() delegates to runtime.GC() rather than driving CPython's gen-0/1/2 logic.

The remaining structural blockers are now:

  1. P2 trace gate. interp.JIT hardcoded false. Until that flips, tier-2 is dead code.
  2. P5 ↔ P1 coherency. Dict watcher hook plumbing exists (DictMutationHook in objects/dict_specialize.go:98-108) but no public subscription API, so the specializer cannot safely invalidate inline caches on dict mutation.
  3. P7 ↔ P1 coherency. Type versionTag exists (objects/type.go:197) but is never automatically invalidated on MRO mutation, __setattr__ on a class, or __bases__ reassignment. Slot tables in objects/slots.go are defined but never pre-populated at type creation; every LookupDescriptor walks the MRO from scratch.
  4. P14 native modules absent. _pickle, _elementtree, _sqlite3 modules are missing; pickle, xml_etree_*, sqlite_synth benches cannot run.
  5. P15 unicode writer absent. Zero of CPython's 13 _PyUnicodeWriter_* functions ported; every f-string, str.format, % formatting allocates intermediate strings.

Goal

Benchcpython 3.14gopy targetgopy 2026-05-16
pyperformance geomean1.0x<=1.5x283x
nbody1.0x<=2.0xN/A (P8)
fannkuch1.0x<=2.0xN/A (P8)
richards1.0x<=2.0x1899x
unpack_sequence1.0x<=2.0x254x
call_method1.0x<=1.5x2407x
regex_compile1.0x<=2.0x1952x
pidigits1.0x<=2.0x7.83x
json_dumps1.0x<=2.0xN/A (P9)

Benchmark coverage matrix

Each benchmark is unlocked by one or more subsystems below. A bench "unlocked" by P_n means P_n is the principal contributor to closing the gap on that bench; PRs targeting P_n must show the corresponding column in "Current benchmark results" moves.

BenchmarkPrimarySecondaryTertiary
nbodyP8 (fix)P10 (float)P1, P2
fannkuchP8 (fix)P1P5
richardsP1 (specializer)P7 (slot cache)P6
call_methodP1P7P6
unpack_sequenceP2 (tier-2 uops)P6 (frame)P1
regex_compileP1P4 (kind strings)P15 (str builder)
json_dumpsP9 (fix)P15 (str builder)P3
pidigitsP3 (long fast path)P1-
pyflateP3P10P1
raytraceP10 (float fast)P1P7
scimark_*P10P1P2
spectral_normP10P1-
floatP10--
generatorsP12 (gen fast path)P6-
async_tree_*P12P6-
gc_collectP13 (GC)P6-
pickleP14 (_pickle)P3P5
unpickleP14P3P5
xml_etreeP14 (_elementtree)P4P15
tomli_loadsP15P4-
loggingP15 (str builder)P1-
django_templateP15P1P7
makoP15P1P7
chaosP10P1P2
deltablueP1P7P6
goP1P5 (dict)P3
hexiomP1P5P3
nqueensP2P1P5
meteor_contestP5P4P1
comprehensionsP2P6P5
deepcopyP13P5P7
pprintP15P1P5
sqlite_synthP14 (_sqlite)--
tornado_httpP12P15P5
typing_runtimeP7P5P1

Subsystems (audit + ports)

Each subsystem below lists, in order:

  1. Audit — what's already in tree (files + LOC) and what's idle
  2. Gap — concrete missing piece(s)
  3. Phases — shippable chunks, in PR-sized increments
  4. Gate — the test/bench signal that proves the phase landed
  5. Estimated win — geomean impact when the phase ships

P0. pyperformance harness — three-way baseline gate

Audit. bench/ shipped 2026-05-16. install_cpython.sh, install_pypy.sh (pinned to PyPy 3.11 v7.3.22 outside the working tree at $HOME/pypy3.11/), run_one.sh, run_small.sh, run_full.sh, cmd/compare/main.go. Eight standalone benches under bench_sources/. First end-to-end run on M4 + macOS 15.7.7 produced the table in "Current benchmark results" below.

Gap.

  • run_full.sh is a placeholder; pyperformance's full corpus has not been driven through run_one.sh against gopy yet.
  • No CI gate. baseline_v0124.json not frozen.
  • Bench-source iteration counts are tuned for cpython ~30-300 ms; PyPy is now warm (geomean 0.80x cpython, matching published 7.3 numbers) but gopy times balloon to 80 s on the dispatch benches. Need an automatic gopy-only iteration shrink for benches where gopy is >100x cpython, so the small subset stays under 10 min.

Phases.

PhaseDescriptionStatusCommit
P0.1Automatic iteration scaler in run_one.sh: probe cpython wall time, then scale bench iter_count for gopy via GOPY_BENCH_SCALE env var so wall time stays under 30 s. Shipped: BASELINE_JSON + TARGET_WALL_MS + EST_SLOWDOWN drive bench_scale(), which sets GOPY_BENCH_SCALE per bench and scales measured wall time back up.DONEca0bef1
P0.2Freeze bench/baseline_v0124.json. Add bench/compare-baseline subcommand: a >10% regression on the same host fails CI. Shipped: bench/baseline_v0124.json + bench/cmd/compare-baseline/main.go (tolerance flag, status-drop + regression gates, exits non-zero on either).DONEca0bef1
P0.3Wire bench/run_small.sh into .github/workflows/. Run nightly + on every PR that touches compile/, vm/, specialize/, optimizer/, objects/. Shipped: .github/workflows/bench.yml (schedule + path-filtered pull_request + workflow_dispatch), uploads results_small.md and the raw JSONs as artifacts.DONEca0bef1
P0.4Extend bench_sources/ to cover every primary-column bench in the coverage matrix that gopy can currently run. Target: 20 benches. Shipped: 20 standalone scripts under bench/bench_sources/ (call_method, chaos, comprehensions, deepcopy, fannkuch, float, go_bench, hexiom, json_dumps, logging_bench, nbody, nqueens, pidigits, pprint_bench, raytrace, regex_compile, richards, spectral_norm, typing_runtime, unpack_sequence).DONEca0bef1
P0.5run_full.sh against pyperformance's vendored sources via the existing shim; mark unsupported benches as module_missing rather than N/A. Current run_full.sh walks bench_sources/ only; vendored pyperformance corpus + module_missing classification still pending.WIP-

Gate. bench/run_small.sh exit 0 + table written to bench/results_small.md; CI re-runs and the regression check passes.

Estimated win. n/a (tooling).

P1. Specializer + inline caches — Python/specialize.c

Audit. Already in tree at ~3500 LOC under specialize/:

FileRole
backoff.go16-bit warmup/cooldown counter machinery
cache.goPer-op cache cell layouts
core.go + quicken.goSpecialize() rewriter + Quicken() seeder
load_attr.go12 LOAD_ATTR specialized variants
binary_op.go9 BINARY_OP variants (INT/FLOAT/STR x +,-,*)
call.go + call_kw.goCALL_PY_EXACT_ARGS, BUILTIN_O/FAST, BOUND_METHOD
compare_op.goCOMPARE_OP_INT/FLOAT/STR
contains_op.goCONTAINS_OP_DICT/SET
for_iter.goFOR_ITER_LIST/TUPLE/RANGE/GEN
load_global.goLOAD_GLOBAL_MODULE/BUILTIN
load_super_attr.goLOAD_SUPER_ATTR_ATTR/METHOD
send.goSEND_GEN
store_attr.goSTORE_ATTR_INSTANCE_VALUE/SLOT/WITH_HINT
store_subscr.goSTORE_SUBSCR_LIST_INT/DICT
to_bool.goTO_BOOL_INT/FLOAT/STR/NONE/BOOL/LIST
unpack_sequence.goUNPACK_SEQUENCE_TUPLE/LIST/TWO_TUPLE
deopt.goSpecialized → adaptive parent table

Tests cover the table extensively.

Gap (the smoking gun — two-part).

  1. Code.Quickened is never set true at runtime:

    $ rg "Quickened\s*=\s*true" --type go # zero hits in runtime
    $ rg "Quickened" --type go | rg -v _test # all reads, no writes
    objects/code.go:76 Quickened bool (declaration)
    vm/adaptive.go:41,54,73 if !e.f.Code.Quickened { return }
    monitor/install.go:126,177 same gate
  2. The compiler emits no inline CACHE cells. Confirmed experimentally on 2026-05-16: setting Quickened = true from liftCode / liftNestedCode / unmarshalCode corrupts every non-trivial program (the IP walks off the end at len=8 for 1 == 1) because specialize.Quicken writes seed counters into what it expects to be CACHE codeunits but are actually real opcodes. CPython's Python/compile.c:write_instr emits a CACHE pseudo-op block sized by _PyOpcode_Caches[op] after every adaptive instruction; the assembler serializes them as zero codeunits; _PyCode_Quicken is what fills them in.

Until both gaps are closed, every adaptive opcode's "attempt to specialize" path is short-circuited. call_method (2407x cpython) is the most visible victim — every method call rebuilds the bound method, walks the MRO, allocates a tuple of args, even though LOAD_ATTR_METHOD_WITH_VALUES and CALL_PY_EXACT_ARGS are both written and tested.

Adjacent gaps surface once the above are closed:

  • The eval loop's LOAD_ATTR_* dispatch table needs an entry point for every specialized variant declared in specialize/load_attr.go. Spot-check vm/eval.go and vm/eval_call.go for missing case arms.
  • monitor/install.go:177 only Quickens when monitoring is off; the default path on import skips it. Wiring belongs in pythonrun/run.go (after parse → compile → marshal load) and imp/ (after marshal.loads(.pyc body)).

Phases.

PhaseDescriptionStatusCommit
P1.0Port Python/compile.c:write_instr and Python/instruction_sequence.c cache-cell emission. After every adaptive opcode, the assembler emits _PyOpcode_Caches[op] zero codeunits so the bytecode layout matches CPython. instr_size, dis CACHE-skipping, vm advance() / jumpBy() all updated. Goldens and the marshal roundtrip test refreshed. Shipped: compile/opcode_caches.go is the single source of truth (CacheCount(op)); compile/assemble.go, assemble_locations.go, dis.go, marshal/code.go, vm/eval.go all consult it; v05test goldens (class_pass, def_add_one, if_pass, while_pass) refreshed for the wider bytecode.DONE67abc0a
P1.1Wire specialize.Enable into pythonrun.liftCode, vm.liftNestedCode, and marshal.unmarshalCode. Shipped: pythonrun/runstring.go:122, vm/eval_simple.go:52, marshal/code.go:239 all call specialize.Enable(out). Quickened = true + CacheObjects []Object slab (gopy's stand-in for CPython's pointer cache cells; Go can't pack GC pointers in []byte). Full go test ./... green.DONE67abc0a
P1.2Audit vm/eval.go for missing specialized-opcode dispatch arms. Coverage achieved via vm/adaptive.go:maybeDeopt: every specialized variant rewrites back to its adaptive parent before dispatch, and the parent body runs. The full deopt table in specialize/deopt.go enumerates every CPython 3.14 specialized opcode. Correctness complete; per-variant fast paths land under P1.4.DONE67abc0a
P1.3Wire de-opt. vm/adaptive.go:53 maybeDeopt calls specialize.Deopt + specialize.Unspecialize, and vm/adaptive.go:72 adaptiveTick drives the counter and routes triggers into the per-family specializers. No panics, no re-walks.DONE67abc0a
P1.4aExtend specializer emission coverage. CPython 3.14 ships specialized opcode variants across 13 families; gopy's emission state per family is broken out in the P1.4a sub-table below. Faithful port of classify_descriptor lives at specialize/descr_classify.go.WIP67abc0a
P1.4bVM fast-path arms for each specialized opcode. Framework landed at vm/eval_specialized.go:trySpecialized, wired into vm/dispatch.go before maybeDeopt so hot sites take the fast path first and fall through to deopt on guard miss. Prerequisite: Code.CacheObjects []Object parallel slab is gopy's stand-in for CPython's in-cache pointer slots (Go cannot stash GC-tracked pointers in a []byte); specialize.{Set,}CacheObject stamp / read by codeunit index, validity gated by the same version cells. Per-family arm state in the P1.4b sub-table below.WIP691c2d7, 71a9181, 6a8aace
P1.5Bytecode cache persistence: Code.Quickened + CacheObjects slab survive marshal.dumps/marshal.loads so .pyc files retain specialization (CPython persists the warmed cache via the co_quickened byte-blob next to co_code). Requires marshal-writer extension for the parallel-pointer slab; the Code.Quickened flag itself rides in the existing flags word.TODO-
P1.6Cross-cutting coherency: install dict watcher (P5.5) + type-version invalidation (P7.5) hooks at specialize.Enable time so inline caches invalidate atomically on dict/type mutation. Without this, every LOAD_ATTR / LOAD_GLOBAL inline cache risks reading stale state after a class attribute assignment.TODO-

P1.4a sub-table — specializer emission per family. Numbers report shipped variants vs the CPython 3.14 variant count, then list the variants still missing. CPython 3.14 reference: Python/specialize.c.

FamilyCoverageVariants shippedMissingStatusCommit
LOAD_ATTR9/13MODULE, CLASS, CLASS_WITH_METACLASS_CHECK, SLOT, INSTANCE_VALUE, WITH_HINT, PROPERTY, METHOD_NO_DICT, NONDESCRIPTOR_NO_DICTMETHOD_WITH_VALUES, NONDESCRIPTOR_WITH_VALUES, METHOD_LAZY_DICT, GETATTRIBUTE_OVERRIDDEN — need Py_TPFLAGS_INLINE_VALUES / managed-dict-offset / __getattribute__-override modelling in objects/type.goWIP67abc0a
STORE_ATTR3/3INSTANCE_VALUE, SLOT, WITH_HINTDONE67abc0a
LOAD_GLOBAL2/2MODULE, BUILTINDONE67abc0a
COMPARE_OP3/3INT, FLOAT, STRDONE67abc0a
CONTAINS_OP2/2DICT, SETDONE67abc0a
FOR_ITER4/4LIST, TUPLE, RANGE, GENDONE67abc0a
LOAD_SUPER_ATTR2/2ATTR, METHODDONE67abc0a
SEND1/1GENDONE67abc0a
STORE_SUBSCR2/2LIST_INT, DICTDONE67abc0a
TO_BOOL6/6BOOL, INT, LIST, NONE, STR, ALWAYS_TRUEDONE67abc0a
UNPACK_SEQUENCE3/3TWO_TUPLE, TUPLE, LISTDONE67abc0a
BINARY_OP13/14ADD_INT, SUBTRACT_INT, MULTIPLY_INT, ADD_FLOAT, SUBTRACT_FLOAT, MULTIPLY_FLOAT, ADD_UNICODE, INPLACE_ADD_UNICODE, SUBSCR_LIST_INT, SUBSCR_TUPLE_INT, SUBSCR_STR_INT, SUBSCR_DICT, SUBSCR_LIST_SLICEBINARY_OP_EXTEND is JIT-only and intentionally skippedDONE67abc0a
CALL5/16PY_EXACT_ARGS, PY_GENERAL, BOUND_METHOD_EXACT_ARGS, BOUND_METHOD_GENERAL, NON_PY_GENERAL8 builtin variants (CALL_BUILTIN_FAST, CALL_BUILTIN_O, CALL_METHOD_DESCRIPTOR_*, CALL_ISINSTANCE, CALL_LEN, CALL_LIST_APPEND, CALL_ALLOC_AND_ENTER_INIT) collapse into CALL_NON_PY_GENERAL — needs METH_* calling-convention flags on BuiltinFunction. CALL_TYPE_1, CALL_STR_1, CALL_TUPLE_1 also pending.WIP67abc0a

P1.4b sub-table — VM fast-path arms per family. Each row tracks the arm count shipped in vm/eval_specialized*.go and the parity gate that backs it.

FamilyArms shippedSourceGateStatusCommit
LOAD_ATTR8/9 emittedvm/eval_specialized.goMODULE, SLOT, CLASS, CLASS_WITH_METACLASS_CHECK, METHOD_NO_DICT, NONDESCRIPTOR_NO_DICT, PROPERTY, INSTANCE_VALUEspecialize/gatedata/spec_property.py (TestGateSpecPropertyAndMethod)WIP — WITH_HINT deferred until dict keys-version cache stamping lands691c2d7, 71a9181
TO_BOOL6/6vm/eval_specialized.goBOOL, INT, LIST, NONE, STR, ALWAYS_TRUEvm/eval_specialized_test.goDONE691c2d7
COMPARE_OP3/3vm/eval_specialized_compare.goINT, FLOAT, STRvm/eval_specialized_test.goDONE691c2d7
CONTAINS_OP2/2vm/eval_specialized.goDICT, SETvm/eval_specialized_test.goDONE691c2d7
UNPACK_SEQUENCE3/3vm/eval_specialized.goTWO_TUPLE, TUPLE, LISTvm/eval_specialized_test.goDONE691c2d7
STORE_SUBSCR2/2vm/eval_specialized.goLIST_INT, DICTvm/eval_specialized_test.goDONE691c2d7
BINARY_OP13/13 non-JITvm/eval_specialized_binary_op.goADD_INT, SUBTRACT_INT, MULTIPLY_INT (math/bits overflow guard); ADD_FLOAT, SUBTRACT_FLOAT, MULTIPLY_FLOAT; ADD_UNICODE shared with INPLACE_ADD_UNICODE; SUBSCR_LIST_INT, SUBSCR_TUPLE_INT, SUBSCR_STR_INT (ASCII fast path), SUBSCR_DICT, SUBSCR_LIST_SLICEspecialize/gatedata/spec_binary_op.py (TestGateSpecBinaryOp)DONE6a8aace
FOR_ITER0/4TODO — needs typed Next helpers on objects.{listIterator,tupleIterator,rangeIterator} so the arm can skip the IterNext slot lookup-
LOAD_GLOBAL2/2vm/eval_specialized_load_global.goMODULE, BUILTINspecialize/gatedata/spec_load_global.py (TestGateSpecLoadGlobal)DONE2f1f603
STORE_ATTR1/3vm/eval_specialized_store_attr.goSLOT (faithful 1-1 port of CPython's macro: validate type_version, write to cached Instance.slots[idx])specialize/gatedata/spec_store_attr.py (TestGateSpecStoreAttr)WIP — INSTANCE_VALUE and WITH_HINT deliberately deferred; they need a Dict.SetValueAt(slot, value) primitive that writes the entry's value cell without re-hashing the key, plus the managed-dict-offset modelling listed in P1.4a. Shipping them before that lands forces a shim that re-runs SetItem(name, value), which is exactly the ad-hoc patch the ground rule forbids.96130ac
SEND0/1TODO — depends on generator-frame plumbing-
LOAD_SUPER_ATTR0/2TODO-
CALL0/5 emittedTODO — gated on closing P1.4a CALL gap first-

Gate.

  • specialize/integration_test.go — run richards.py 3 times under a harness that asserts the specialized opcodes outnumber generic by 10:1 after warmup.
  • Small-subset bench: call_method, richards, regex_compile drop to <200x cpython (from 1899x-2407x).

Estimated win. 6-10x geomean improvement. Single biggest lever.

P2. Tier-2 micro-op interpreter — Python/executor_cases.c.h, Python/optimizer_bytecodes.c

Audit. Actual LOC under optimizer/ is 13,501 (not the ~23k earlier estimate); the discrepancy was the difference between wc -l of generated stub bodies and what was actually shipping. Per-file breakdown:

FileLOCRole
uops_stubs_gen.go8263per-uop stub bodies (generated; all 271 are deopt pass-throughs)
symbols.go734symbolic-state lattice (Python/optimizer_symbols.c)
uop_ids_gen.go661uop opcode enum (generated)
uops_dispatch_gen.go592dispatch switch
trace.go486trace projection (Python/optimizer.c:553-987)
types.go404metadata
analysis.go354analysis pass (Python/optimizer_analysis.c:625-654)
uop_meta_gen.go335generated metadata
executor.go324lifecycle (Python/optimizer.c:216-272,1100-1115,1417-1518)
watcher.go320type / dict mutation callbacks
optimize.go258optimization driver (Python/optimizer.c:113-163)
uops_impl.go174hand-written uop bodies
side_table.go143side-table for backedges
uops.go132executor entry + trampoline
pyobject.go128PyObject helpers
bloom.go86bloom filter (Python/optimizer.c:1357-1414)
uops_print.go60dis output
dis_hook.go47dis integration

Stubs are generated for all 319 uop IDs. The hand-ported set in uops_impl.go covers 14 uops, but only 3 of them (_LOAD_FAST, _STORE_FAST, _CHECK_VALIDITY) are P2.2 hot-path targets. The other 11 are scaffolding: _NOP, _EXIT_TRACE, _JUMP_TO_TOP, _START_EXECUTOR, _SET_IP, _POP_TOP, _COPY, _SWAP, _PUSH_NULL, _LOAD_FAST_BORROW, _MAKE_WARM.

Gap (the smoking gun for P2). The tier-2 entry gate is wired, but interp.JIT is hardcoded false at vm/tier2.go:36:

func (e *EvalState) tryWarmupTier2(...) {
if !interp.JIT {
return
}
...
}

grep -rn "interp.JIT\s*=" --type go returns zero hits. The projection (trace.go), analysis (analysis.go), executor (executor.go), and dispatch loop (vm/tier2.go:enterExecutor) are all wired but never reachable.

The other two structural gaps are full-file ports that have not started:

  • Python/optimizer_bytecodes.c (1107 LOC, 0 ported). The abstract-interpreter case table optimize_uops is supposed to dispatch through. gopy's analysis.go:optimizeUops (lines 230-256) iterates the trace with an empty per-opcode dispatcher and bails to "unknown semantics" on every row. No constant folding, no guard elimination, no type narrowing.
  • Python/executor_cases.c.h (7163 LOC, 0 ported as real bodies). The 271 stubs all return s.unimplementedUop(NAME) which deopts to tier-1. Hot paths like _BINARY_OP_ADD_INT, _GUARD_BOTH_INT, _LOAD_ATTR_INSTANCE_VALUE, _CALL_PY_EXACT_ARGS, _PUSH_FRAME, _FOR_ITER_TIER_TWO, _GUARD_TYPE_VERSION, _RESUME_CHECK are all stubs.

Two deprecated-shim flags annotate the situation: uops_impl.go:14 and analysis.go:23 both carry DEPRECATED (spec 1714) notes indicating the uop bodies should move to vm/eval_uops_gen.go once the cases-generator port (spec 1714) ships.

Phases (full-file ports, no piecemeal uop cherry-picking).

PhaseDescriptionStatusCommit
P2.1Open the JIT gate. Flip interp.JIT to true by default (or behind a -O2 CLI flag) so trace projection actually fires. Add optimizer/trace_test.go that runs call_method.py and asserts ≥1 trace was projected.TODO-
P2.2Port Python/optimizer_bytecodes.c in full (1107 LOC). This is the abstract-interpreter case table that optimize_uops dispatches through. Lands as optimizer/optimizer_bytecodes_gen.go driven by the spec-1714 cases generator. Gate: every uop ID has a corresponding case body (no unknown semantics bail).TODO-
P2.3Port Python/executor_cases.c.h in full (7163 LOC) into vm/eval_uops_gen.go. Driven by the spec-1714 cases generator. Replaces the 271 deopt-pass-through stubs in optimizer/uops_stubs_gen.go. Gate: every uop ID has a real executable body.TODO-
P2.4Wire tier-2 → tier-1 deopt path: on guard fail mid-trace, fall back to the adaptive opcode at the recorded resume offset. Validate against _CHECK_VALIDITY and _GUARD_TYPE_VERSION failure scenarios.TODO-
P2.5Turn on the tier-2 executor by default for any function that has been Quickened (depends on P1.5 marshal persistence so warm caches survive).TODO-

Gate.

  • optimizer/uops_test.go covers every uop ID with one positive case and one guard-fail case (table-driven, generated).
  • optimizer/bench_test.go::BenchmarkTier2Nbody shows the tier-2 path is ≥2x faster than tier-1 on the warm loop.
  • The 11 scaffolding uops in uops_impl.go can stay hand-written; P2.2/P2.3 covers the 271 stubs that currently deopt.

Estimated win. 1.5-2x on top of P1.

P3. PyLong fast path — Objects/longobject.c

Audit. CPython 3.14 Objects/longobject.c is 6871 LOC and exports ~90 public PyLong_* functions. gopy has selective coverage across 6 files totalling ~1050 LOC:

FileLOCRole
objects/int.go216NewInt, NewIntFromBig, Int64, BigInt, Sign. Constructor + getters.
objects/long_cache.go77small-int singleton cache [-5, 256] (SmallInt)
objects/long_arith.go157intAdd, intSub, intMul, intFloorDiv, intMod, intDivmod, intPower
objects/long_bitwise.go165intAnd, intOr, intXor, intLshift, intRshift, intInvert
objects/long_misc.go152intAbs, intNeg, intPos, intHash, intBool
objects/long_parse.go285intFromString

Audit verified NewInt(x int64) consults smallIntFromInt64(x) at int.go:67-75 and returns the singleton when x is in [-5, 256], so the small-int cache is wired (the earlier draft was wrong on that point). Every arithmetic op still allocates a fresh *Int and routes through math/big.Int, even when both sides fit in int64.

Gap.

  • No compact representation: Int always carries a heap-allocated big.Int (int.go:14-16). CPython packs |n| < 2^30 inline in the PyLong header via _PyLong_IsCompact.
  • No int64 fast-path: intAdd at long_arith.go:17-39 unwraps both operands and calls big.Int.Add unconditionally. No short-circuit for (a.v.IsInt64() && b.v.IsInt64()) && (no overflow).
  • __index__ slot is defined on NumberMethods (slots.go) but not wired on IntType at int.go:56-59.
  • Unported PyLong functions include PyLong_AsLongAndOverflow, PyLong_AsInt, PyLong_AsNativeBytes (PEP 1692), PyLong_FromNativeBytes, PyLong_AsDouble, _PyLong_Frexp, and the v3.14 streaming PyLongWriter_* API.

Phases.

PhaseDescriptionStatusCommit
P3.1objects/long_fast.go: detect inline-representable values, store unboxed int64 alongside big.Int. Add compactValue int64; isCompact bool (or single int64 with sentinel bit).TODO-
P3.2Route New(int64) and FromString through long_cache.go for [-5, 256]. Allocation-free.TODO-
P3.3Add/Sub/Mul/Neg/Abs fast-path: int64 arithmetic with overflow check when both compact; fall back to big.Int on overflow.TODO-
P3.4__index__ / PyLong_AsLong fast path.TODO-
P3.5_PyLong_FromUint64 / _PyLong_FromInt64 mirrored constructors that bypass big.Int when input fits compact.TODO-

Gate.

  • objects/long_arith_test.go adds a cross-check: every fast-path result equals the big.Int slow-path result on a 10k-entry random table.
  • BenchmarkLongAddSmall/BenchmarkLongMulSmall show 0 allocs and ≥5x speedup vs the current path.
  • pidigits bench drops from 7.83x to under 2x cpython.

Estimated win. 3x on integer-heavy benchmarks (pidigits, pyflate, go, hexiom). Geomean impact ~1.4x.

P4. PyUnicode kind tags — Objects/unicodeobject.c

Audit. objects/unicode*.go uses Go's UTF-8 string as backing storage, plus unicode_ctype.go for category lookups. Indexing, slicing, find/count/replace all walk bytes.

Gap.

  • No kind tag (Latin-1/BMP/full Unicode).
  • Indexing is O(n) for any non-ASCII string. find, count, replace likewise walk by rune.
  • str.encode/bytes.decode round-trips through the rune iterator.

Phases.

PhaseDescriptionStatusCommit
P4.1objects/unicode_kind.go: detect kind at construction. Latin-1: byte-equal to ASCII; BMP: re-encode to []uint16; Full: []rune.TODO-
P4.2Kind-dispatched __getitem__, __len__, slicing. Latin-1 hits a byte-index path (allocation-free for single chars via small-string cache).TODO-
P4.3Kind-dispatched find, rfind, count, index, replace, split. Latin-1 → bytes.IndexByte / bytes.Count (memchr speed).TODO-
P4.4_PyUnicodeWriter port (lands with P15).TODO-
P4.5Small-string cache: __getitem__ returning a one-char str is allocation-free for ASCII.TODO-

Gate.

  • objects/unicode_kind_test.go covers indexing/slicing/find/count for all three kinds against the cpython-reference behavior.
  • BenchmarkStrFindAscii shows kind-1 strings hit the byte-find fast path (alloc count = 0).
  • regex_compile ratio compresses (P1 is primary; P4 is secondary).

Estimated win. 2x on string-heavy benchmarks (regex_compile, html5lib, mako, django_template).

P5. Dict open-addressing + split keys — Objects/dictobject.c

Audit. CPython 3.14 Objects/dictobject.c is 7824 LOC. gopy's dict already uses an open-addressed layout (the earlier draft was wrong about map[any]any). Supporting files:

FileRole
dict.gocombined dict, already open-addressed: entries []dictEntry + order []int
dict_split.goshared-keys surface (NewSplitDict, ConvertToCombined); zero memory savings
dict_lookup.golookup dispatch via d.lookup(hash, key)
dict_iter.goiteration ordered by order slot indices
dict_mutate.goinsert/delete/resize, drives invalidateKeysVersion
dict_specialize.goDictMutationHook (fired on every mutation), IsKeysUnicode, LookupString, GetKeysVersion

dict_split.go is honest about the surface-only gap: NewSplitDict returns a regular combined Dict pre-populated with the shared key names mapped to None. Instances do not share keys with the type; the storage savings CPython gets from split-keys are zero in gopy.

Verified layout at dict.go:30-59:

type Dict struct {
Header
entries []dictEntry // open-addressed slot array
order []int // insertion-order indices
used, fill int
kind dictKind
sharedKeys *SharedKeys
keysVersion uint32 // dk_version (specializer)
mutationCount uint32 // watcher tally
}
type dictEntry struct {
hash int64
key, value Object
used, dummy bool
}

The hooks the specializer needs are mostly plumbed: invalidateKeysVersion fires DictMutationHook(d) from dict_mutate.go:82 (insert), :105 (delete), :118 (resize).

Gap.

  • Split-keys saves zero memory; every instance still carries a full Dict. CPython's PyDictKeys_NumValues / per-instance values[] slab is not modelled.
  • No PyDict_Watch subscription API. DictMutationHook is a bare function-pointer at module scope (dict_specialize.go:98-108) intended for the tier-2 optimizer to install at WatcherInit time. No public watcher-handle API exists for user code or other subsystems.
  • No _PyDict_SetItem_KnownHash fast path. dictInsert at dict_mutate.go:60-84 always rehashes via d.lookup(hash, key), ignoring a pre-computed hash even when the caller (e.g. a LOAD_ATTR specialized arm) knows it.
  • Cross-cutting: P1 inline caching cannot safely cache dict keys across calls until P5.5 watcher + P7 type-version invalidation land together. Today the cache works only because the specializer refuses to elide the keys_version check on the hot path.

Phases.

PhaseDescriptionStatusCommit
P5.1Audit / regression-check the existing open-addressed layout against Objects/dictobject.c:lookdict probe sequence. Add objects/dict_lookup_parity_test.go table-driven from CPython's hash collisions.TODO-
P5.2Real split-keys storage: per-type SharedKeys object owns the entries-array shape; instance __dict__ carries values []Object only. Materialise to combined on delete or non-shared insert. Cite Objects/dictobject.c:insertion_resize_inplace.TODO-
P5.3_PyDict_SetItem_KnownHash fast path: skip rehash when caller passes the hash. Wire from LOAD_ATTR / LOAD_GLOBAL specialized arms. Cite Objects/dictobject.c:_PyDict_SetItem_KnownHash.TODO-
P5.4Public watcher subscription API: PyDict_Watch(watcher_id, dict) / PyDict_AddWatcher(callback) -> int8_t. Cite Objects/dictobject.c:5797 PyDict_AddWatcher. Replaces the bare DictMutationHook pointer.TODO-
P5.5Install the watcher at specialize.Enable time + invalidate inline caches on dict mutation. Interacts with P1.6.TODO-

Gate.

  • objects/dict_oa_test.go cross-checks every op against a reference implementation on a randomized workload.
  • BenchmarkDictLookup shows 0 allocations on the hot path.
  • meteor_contest / go benches drop primarily on P5.

Estimated win. 2x on attribute- and call-method-heavy code.

P6. Frame free-list + LOAD_FAST_CHECK — Objects/frameobject.c, Python/ceval.c

Audit. objects/frame.go, objects/frame_locals.go, objects/frame_snapshot.go cover the frame + locals representation. vm/eval.go allocates a fresh frame per call. P6.2 LOAD_FAST_CHECK shipped via spec 1716:

  • compile/flowgraph_cfg_locals.go:320-358 scanBlockForLocals detects uninitialized locals and rewrites LOAD_FAST → LOAD_FAST_CHECK.
  • vm/eval_dispatch_handwritten.go:63-72 opLOAD_FAST_CHECK mirrors CPython's bytecodes.c check.
  • Opcode 88 in compile/opcodes_gen.go matches CPython 3.14's metadata.

Gap.

  • No frame free-list. Every function call allocates *Frame + a fresh []Object for locals + a fresh stack slice.
  • No LOAD_FAST_BORROW / STORE_FAST_STORE_FAST opcodes (CPython 3.14 elide-the-incref-pair pair).
  • vm/eval_call.go rebuilds the args tuple per call even for CALL_PY_EXACT_ARGS.

Phases.

PhaseDescriptionStatusCommit
P6.1vm/frame_pool.go: per-goroutine free list, capped at 20. Recycle frame + locals + stack slices; reset, not free.TODO-
P6.2LOAD_FAST_CHECK codegen in compile/flowgraph_cfg_locals.go:scanBlockForLocals + eval arm in vm/eval_dispatch_handwritten.go:opLOAD_FAST_CHECK.DONE (spec 1716)-
P6.3LOAD_FAST_BORROW / STORE_FAST_STORE_FAST (CPython 3.14 new opcodes that elide the incref pair).TODO-
P6.4Args-tuple bypass: CALL_PY_EXACT_ARGS stores args directly into the callee's frame locals.TODO-

Gate.

  • vm/frame_pool_test.go proves recycle works under load.
  • BenchmarkCallNop shows 0 allocations on the hot path.

Estimated win. 1.5x on call-heavy code (richards, deltablue).

P7. Type slot caching — Objects/typeobject.c

Audit. CPython 3.14 Objects/typeobject.c is 12,302 LOC. gopy spreads its type implementation across objects/type.go, type_call.go, type_attr.go, type_getsets.go, type_repr.go, type_specialize.go, usertype.go. The MRO walk lives in descr.go:LookupDescriptor. type_specialize.go is the hook the specializer calls.

Slot tables (NumberMethods, SequenceMethods, MappingMethods, AsyncMethods) exist in slots.go covering most of CPython's nb_*, sq_*, mp_*, am_* slots, but objects/type_slots.go does not exist; the spec's reference to it is aspirational.

The type carries a versionTag uint32 at type.go:197 plus VersionTag() / InvalidateVersionTag() getters in type_specialize.go:10-39.

Gap.

  • LookupDescriptor(t, "__add__") at descr.go:101-114 walks t.MRO on every invocation. No slot-table cache. Operator dispatch (intAdd, intMul, etc.) re-resolves descriptors per call.
  • No _PyType_AssignSpecialMethods equivalent. NewType at type.go:255-266 builds MRO but does not pre-populate operator slots from MRO.
  • versionTag is never automatically invalidated. Searching InvalidateVersionTag returns zero call sites in type_attr.go or the rest of objects/; manual invalidation is the only path. Class __setattr__, MRO recomputation, and __bases__ reassignment do not bump the tag.
  • The Index slot on NumberMethods is defined but not wired on IntType at int.go:56-59.

Phases.

PhaseDescriptionStatusCommit
P7.1objects/type_slots.go: full slot-table struct mirroring CPython PyTypeObject (nb_add, sq_length, mp_subscript, tp_call, tp_iter, ...).TODO-
P7.2_PyType_AssignSpecialMethods: walk the MRO once at type creation, populate the slot table.TODO-
P7.3Type version tag (monotonic uint32 bumped on MRO mutation, class __setattr__, __class__ reassignment).TODO-
P7.4Operator dispatch (abstract_binop.go, abstract_sequence.go) consults the slot table first; falls back to Lookup only if slot nil.TODO-
P7.5Invalidation hook: type-version change auto-stales every inline cache keyed on that version (interacts with P1).TODO-

Gate.

  • All existing operator tests stay green.
  • objects/slots_test.go: slot table populated correctly for a hand-rolled type; invalidates on mutation.
  • richards ratio compresses by another ~2x on top of P1.

Estimated win. 1.5x on operator-heavy code (richards, deltablue, typing_runtime_protocols).

P8. Augmented STORE_SUBSCR codegen — Python/compile.c

Symptom. target[idx] -= rhs raises TypeError: 'int' object does not support item assignment whenever target is bound through a nested unpack in a for-loop. Confirmed reproducer:

pairs = [(([1,2,3], [4,5,6], 7), ([10,20,30], [40,50,60], 70))]
for ((p1, v1, m1), (p2, v2, m2)) in pairs:
v1[0] -= 100 # raises, even though v1 is correctly a list

v1[0] = 99 works on the same binding; v1[0] -= 100 does not.

Gap. gopy's compiler lowers v[0] -= rhs into an opcode sequence that misroutes STORE_SUBSCR's container target after BINARY_OP. The SET_ITEM dispatches against the loaded value (an int) instead of the list. cpython's correct sequence is:

LOAD_FAST v
LOAD_CONST 0 ; index
COPY 2 ; dup container
COPY 2 ; dup index
BINARY_SUBSCR ; loads v[0]
LOAD_CONST 100
BINARY_OP -=
SWAP 3 ; restore stack: ..., new_val, container, index
STORE_SUBSCR

gopy is likely missing the COPY 2 / SWAP 3 pair, so the second operand on STORE_SUBSCR's stack-effect slot is the loaded int, not the saved container.

Phases.

PhaseDescriptionStatusCommit
P8.1Capture gopy dis output for the reproducer; diff against cpython 3.14. Land the diff in compile/augassign_test.go::TestStoreSubscrSequence.TODO-
P8.2Fix the lowering in compile/codegen.go (Subscript LHS in augmented context).TODO-
P8.3Extend the test matrix: augmented STORE_SUBSCR with all bound-context flavors (nested unpack, dict.get returns, comprehension target).TODO-
P8.4Same audit for augmented STORE_ATTR (obj.attr -= rhs).TODO-

Gate. nbody, fannkuch run to completion under bin/gopy; both show up with real numbers in the small-subset table.

Estimated win. Unblocks 2 N/A benches.

P9. int.__format__ format-spec parser — Python/formatter_unicode.c

Symptom. '{0:04x}'.format(255) raises TypeError: unsupported format string passed to int.__format__. stdlib/json/encoder.py:31 ('\\u{0:04x}'.format(i) in ESCAPE_DCT initialisation) hits this on import json, blocking json_dumps.

Gap. gopy's int formatter parses bare type codes (x, o, b, d) only. It rejects any prefix carrying fill/align/sign/alt/width/ grouping/precision.

Phases.

PhaseDescriptionStatusCommit
P9.1objects/long_format.go: port Python/formatter_unicode.c:parse_internal_render_format_spec into an InternalFormatSpec struct (fill, align, sign, alt, width, grouping, precision, type).TODO-
P9.2Wire int.__format__ to the parsed spec; route through the existing decimal/hex/octal/binary renderers, applying padding + alignment + sign + grouping.TODO-
P9.3Float-spec coercion: '{:.2g}'.format(255) promotes the int to float and dispatches to float.__format__. Mirror cpython.TODO-
P9.4Table-driven test pulled from CPython Lib/test/test_format.py.TODO-

Gate. objects/long_format_test.go matches cpython output on every spec from test_format.py. json_dumps runs to completion under bin/gopy.

Estimated win. Unblocks 1 N/A bench plus removes a class of silent-format failures hiding in other stdlib paths.

P10. Float fast path — Objects/floatobject.c

Audit. objects/float.go, objects/float_parse.go. Stored as boxed *Float wrapping a Go float64. Every Float{v: x} is a heap allocation.

Gap.

  • No free list / small-float cache.
  • _BINARY_OP_ADD_FLOAT is in the specializer's vocabulary but the eval arm allocates a fresh *Float per op. CPython has the same per-op cost but its tier-2 executor can elide it; gopy's tier-2 executor doesn't see floats yet.
  • float.__format__ may share P9's spec-parser gap; audit before P9 ships.

Phases.

PhaseDescriptionStatusCommit
P10.1objects/float_pool.go: per-goroutine free list for *Float. Lookback list of N=128 recently-freed Float pointers. Reset, don't re-allocate.TODO-
P10.2BINARY_OP_ADD_FLOAT / SUBTRACT_FLOAT / MULTIPLY_FLOAT / TRUE_DIVIDE_FLOAT fast path: if the LHS is a temporary (refcount=1, recycled from the pool), mutate in place.TODO-
P10.3_BINARY_OP_*_FLOAT tier-2 uops hand-ported (depends on P2.2).TODO-
P10.4float.__format__ audit + spec-parser share with P9.TODO-

Gate. BenchmarkFloatAddHot shows allocation-free path. nbody ratio compresses (P8 must land first).

Estimated win. 2.5x on float-heavy benchmarks (nbody, raytrace, spectral_norm, scimark_*). Geomean ~1.3x.

P11. Compiler CFG optimizer + peephole — Python/flowgraph.c, Python/compile.c

Audit. Closed via spec 1716. compile/flowgraph_cfg_passes.go hosts the four big passes plus peephole, ported 1:1 from Python/flowgraph.c:

CPython functiongopy site
_PyCfg_FromInstructionSequencespec 1715 phase 2 (#657)
_PyCfg_OptimizedCfgToInstructionSequencespec 1716 C.1 (#669)
cfg_jump_threadflowgraph_cfg_passes.go:2069-2080 cfgJumpThread
remove_unreachable_basic_blocksflowgraph_cfg_passes.go:476-513 cfgRemoveUnreachable
remove_redundant_jumpsflowgraph_cfg_passes.go:449-474 cfgRemoveRedundantJumps
fold_const_binopflowgraph_cfg_passes.go:1717-1764 basicblockFoldConstBinop
fold_const_unaryopflowgraph_cfg_passes.go:1390-1420 basicblockFoldConstUnaryop
optimize_basic_blockflowgraph_cfg_passes.go:1444-1655 optimizeBasicBlockCFG
_PyCfg_OptimizeCodeUnitflowgraph_cfg_passes.go:2375-2412 cfgOptimizeCodeUnit

Phases.

PhaseDescriptionStatusCommit
P11.1compile/flowgraph_cfg.go: basic-block graph construction. Cite Python/flowgraph.c:_PyCfg_FromInstructionSequence.DONEspec 1715 phase 1 (#659)
P11.2Port the four big passes: jump threading, eliminate-after-terminator, fold-constant-jumps, prune-unreachable.DONEspec 1715 phase 3 (#656) + spec 1716 phase C.1 (#669)
P11.3Port the peephole table from Python/flowgraph.c:optimize_basic_block.DONEspec 1715 phase 3 (#656)
P11.4dis.dis integration: the optimizer pass runs before final emission via cfgOptimizeCodeUnit.DONEspec 1716 phase D (#672)

Gate. compile/flowgraph_cfg_passes_test.go is table-driven against cpython Lib/test/test_peepholer.py cases. The L1 codegen

  • L3/L4 assemble parity gates landed in spec 1716 phase E (#673).

Estimated win. 1.1-1.15x geomean (small but uniform). Already realised.

P12. Generator + coroutine fast path — Python/genobject.c

Audit. objects/generator.go, objects/async_gen.go, vm/eval_gen.go, vm/eval_resume.go. gopy uses a goroutine + channel model (one goroutine per generator body, channels for send / yield), so the "per-send frame copy" cost the original draft cited does not apply. The frame is owned by the generator's goroutine; send is a channel write and a select, not a snapshot restore.

CPython 3.14 reference: Python/genobject.c:gen_send_ex2 (line 192), gen_send_ex (298), gen_iternext (630), gen_throw (599), gen_close (387). gopy parity:

CPython entrygopy site
gen_sendobjects/generator.go:101-110 genSendMethod
gen_iternextobjects/generator.go:255 genIterNext
gen_throwobjects/generator.go:125-141 genThrowMethod
gen_closeobjects/generator.go:143-156 genCloseMethod
async_gen_anextobjects/async_gen.go:58-72
async_gen_asendobjects/async_gen.go:58-72
async_gen_athrowobjects/async_gen.go:58-72

GET_AITER / GET_ANEXT fast paths are already in place in vm/eval_gen.go.

Gap.

  • SEND opcode is not yet a tier-2 uop (gated on P2.3).
  • Async-bench coverage is blocked first on the asyncio module port (spec 1711). Generator dispatch is not the dominant cost.

Phases.

PhaseDescriptionStatusCommit
P12.1Generator/coroutine core (channel + goroutine model). Frame owned by goroutine, no per-send copy.DONE-
P12.2SEND opcode tier-2 uop. Gated on P2.3 (Python/executor_cases.c.h full port).TODO-
P12.3GET_AITER / GET_ANEXT / END_ASYNC_FOR fast path.DONE-
P12.4Coroutine suspend/resume via goroutine + channel swap.DONE-

Gate. objects/generator_test.go::BenchmarkGenSendHot shows ≤2 allocations per send (Go runtime overhead for the channel handoff). generators bench drops to under 5x cpython once tier-2 SEND lands.

Estimated win. Already realised for sync generators. Blocked on asyncio (spec 1711) for async benches.

P13. GC tracking + generational collector — Python/gc.c

Audit. module/gc/ is substantially in tree (38 files). The tracking machinery, the Python-facing API, and most introspection helpers are ported:

CPython entrygopy site
PyObject_GC_RegisterFinalizermodule/gc/gc.go:27-34 RegisterFinalizer
PyObject_CallFinalizerFromDeallocmodule/gc/gc.go:41-62 Finalize
_PyObject_GC_TRACKmodule/gc/gc.go:68-81 Track
_PyObject_GC_UNTRACKmodule/gc/gc.go:89-101 Untrack
_PyObject_GC_IS_TRACKEDmodule/gc/gc.go:106-111 IsTracked
gc_collect_implmodule/gc/module.go:92-112 gcCollect (delegates to runtime.GC())
gc_enable_impl / gc_disable_impl / gc_isenabled_implmodule/gc/module.go:117-138
gc_get_threshold_impl / gc_set_threshold_implmodule/gc/module.go:143-182 (wired but not driving collections)
gc_get_count_implmodule/gc/module.go:187-197 gcGetCount
gc_is_tracked_implmodule/gc/module.go:202-210 gcIsTracked
gc_get_objects_implmodule/gc/module.go:215-236 gcGetObjects
gc_get_referrers_implmodule/gc/module.go ~240+ gcGetReferrers
gc_get_referents_implmodule/gc/module.go ~270+ gcGetReferents

State machine in module/gc/state.go (~250 LOC) carries a 3-generation counter but does not drive collections.

Gap.

  • gc.set_threshold(g0, g1, g2) stores values but does not gate runtime.GC() invocations on threshold crossings.
  • gc.collect(generation) delegates to runtime.GC() rather than walking the gopy gen-N lists.
  • __del__ ordering is Go GC traversal order, not CPython gen-N finalisation order.

Phases.

PhaseDescriptionStatusCommit
P13.1Drive gc.collect(generation) and gc.set_threshold(g0, g1, g2) from module/gc/state.go generation counters. Trigger runtime.GC() only when gen-0 threshold crossed. Track gen-1/gen-2 promotions.TODO-
P13.2Python-level finalizer queue: order __del__ calls by gc-generation.TODO-
P13.3Cycle detection for __del__ resurrected objects.TODO-

Gate. module/gc/gc_test.go mirrors cpython Lib/test/test_gc.py. The gc_collect bench returns plausible numbers (within 10x cpython; we can't beat Go's GC).

Estimated win. Low geomean impact (gc_collect alone). Mostly unblocks the cpython test suite gc tests.

P14. Native C-extension paths — _pickle, _elementtree, _sqlite3

Audit. Native-module reality (verified 2026-05-19):

Modulegopy directoryStatus
_picklemodule/_pickle/ does not existAbsent. No pure-Python fallback either.
_elementtreemodule/_elementtree/, module/xml/ do not existAbsent.
_sqlite3module/_sqlite3/ does not existAbsent.
_csvmodule/_csv/ exists; stdlib/csv.py exists (19186 bytes)Partial (pure-Python fallback in tree).

Gap.

  • pickle / unpickle cannot run at all (no fallback to import).
  • xml_etree_* cannot run (xml.etree.ElementTree requires _elementtree).
  • sqlite_synth cannot run.
  • _csv benchmarks run via the pure-Python fallback (~10x slower than the C _csv CPython uses by default).

CPython sources to port from:

FileLOCRole
Modules/_pickle.c8500Pickle protocol 5 encoder + decoder
Modules/_elementtree.c4000XML element tree
Modules/_sqlite/6000sqlite3 connection/cursor
Modules/_csv.c1600C-native csv reader/writer

Critical pickle protocol-5 opcodes from Modules/_pickle.c:107-137: PROTO (0x80), FRAME (0x95), SHORT_BINUNICODE (0x8c), SHORT_BINBYTES (0x43), STACK_GLOBAL (0x93), MEMOIZE (0x94), BYTEARRAY8 (0x96).

Phases.

PhaseDescriptionStatusCommit
P14.1module/_pickle/: Go-native pickle protocol 5 encoder + decoder. Full port of Modules/_pickle.c (8500 LOC).TODO-
P14.2module/_elementtree/: thin wrapper over encoding/xml matching the cpython _elementtree API. Full port of Modules/_elementtree.c (4000 LOC).TODO-
P14.3module/_sqlite3/: cgo binding to libsqlite3 or pure Go via modernc.org/sqlite. Full port of Modules/_sqlite/ (6000 LOC).TODO-
P14.4module/_csv/: Go-native csv reader/writer matching Modules/_csv.c (1600 LOC).WIP-

Gate. pickle / unpickle benches drop to under 3x cpython. xml_etree_* benches drop to under 5x.

Estimated win. Targeted; only the named benches. Critical because three pyperformance benches are currently un-runnable.

P15. Unicode writer + string concat — Objects/unicodeobject.c

Audit. Zero of CPython's 13 _PyUnicodeWriter_* functions are ported (Objects/unicodeobject.c:13737-14243). gopy concatenates strings via the Go string + string operator, allocating per op. Format/join paths build intermediate strings. There is no objects/unicode_writer.go.

Functions to port (with CPython line refs):

CPython functionLineRole
_PyUnicodeWriter_Init13737init writer struct
_PyUnicodeWriter_InitWithBuffer13794init from buffer
_PyUnicodeWriter_Update13713internal update
_PyUnicodeWriter_PrepareInternal13804pre-allocate buffer
_PyUnicodeWriter_PrepareKindInternal13882kind-aware prepare
_PyUnicodeWriter_WriteCharInline13903inline single-char write
_PyUnicodeWriter_WriteChar13914single-char write
_PyUnicodeWriter_WriteStr13932write substring
_PyUnicodeWriter_WriteSubstring14007write slice
_PyUnicodeWriter_WriteASCIIString14063ASCII fast path
_PyUnicodeWriter_WriteLatin1String14186Latin-1 fast path
_PyUnicodeWriter_Finish14200finalise + return string
_PyUnicodeWriter_Dealloc14243cleanup

Gap.

  • No _PyUnicodeWriter equivalent. json_dumps, logging, mako, django_template all hit this.
  • str.join allocates the join separator slice per call.
  • % formatting and str.format go through immutable concat.
  • f-string codegen produces FORMAT_VALUE + BUILD_STRING which does N concats for an N-piece f-string.

Phases. P15.1 depends on P4.1 (kind detection) so the writer's Finish() can pack into the right backing storage.

PhaseDescriptionStatusCommit
P15.1objects/unicode_writer.go: pre-sized writer with kind-aware finalisation (matches P4). Port the 13 _PyUnicodeWriter_* functions in full. API: WriteStr, WriteASCII, WriteRune, Finish() *Unicode.TODO-
P15.2Re-route str.join, str.format, % formatting through the writer. Audit objects/str_methods.go + objects/str_format.go.TODO-
P15.3BUILD_STRING opcode lowering: emit a single writer.Finish() call instead of N concats. Touch vm/eval_dispatch_gen.go.TODO-
P15.4f-string codegen: in compile/codegen.go, lower an f-string's pieces directly into writer calls (skip FORMAT_VALUE + BUILD_STRING). Shares P9 spec-parser.TODO-

Gate. BenchmarkStrFormatHot allocation-free for static format strings. json_dumps, logging, pprint benches drop materially.

Estimated win. 2x on text-heavy benchmarks. Geomean ~1.2x.

Checklist

SubsystemCPython sourcegopy destinationEstimated winStatusCommit
P0. pyperformance harnessn/a (tooling)bench/n/aWIPca0bef1
P1. Specializer wire-upPython/specialize.cspecialize/6-10xWIP (P1.0-P1.3 done, P1.4-P1.6 open)67abc0a, 691c2d7, 71a9181, 6a8aace, 96130ac, 2f1f603
P2. Tier-2 (full-file ports)Python/optimizer_bytecodes.c, Python/executor_cases.c.hoptimizer/, vm/eval_uops_gen.go1.5-2xWIP (scaffolding + JIT gate hardcoded off)-
P3. PyLong fast pathObjects/longobject.cobjects/long_fast.go3xTODO-
P4. PyUnicode kind tagsObjects/unicodeobject.cobjects/unicode_kind.go2xTODO-
P5. Dict open-addressingObjects/dictobject.cobjects/dict.go (extend)2xWIP (open-addressed layout already in tree, split-keys + watcher API + KnownHash gaps remain)-
P6. Frame free-list + LOAD_FAST_CHECKObjects/frameobject.c, Python/ceval.cvm/frame_pool.go, compile/flowgraph_cfg_locals.go, vm/eval_dispatch_handwritten.go1.5xWIP (P6.2 done via spec 1716; P6.1/P6.3/P6.4 open)spec 1716
P7. Type slot cacheObjects/typeobject.cobjects/type_slots.go1.5xTODO-
P8. Aug-STORE_SUBSCR fixPython/compile.ccompile/codegen_stmt_misc.go:85-105unblock 2 N/ATODO-
P9. int.format specPython/formatter_unicode.cobjects/long_format.gounblock 1 N/ATODO-
P10. Float fast pathObjects/floatobject.cobjects/float_pool.go2.5xTODO-
P11. CFG optimizer + peepholePython/flowgraph.ccompile/flowgraph_cfg_passes.go1.1xDONE (spec 1716)9d7d9f0, 37563f5
P12. Generator fast pathPython/genobject.cobjects/generator.go, vm/eval_gen.go3x asyncDONE (channel + goroutine model); P12.2 SEND tier-2 uop depends on P2.3-
P13. GC trackingPython/gc.cmodule/gc/low geomeanWIP (~90% done; thresholds + finalizer ordering pending)-
P14. Native pickle/xml/sqliteModules/_pickle.c, etcmodule/_pickle/, etcbench-specificTODO-
P15. Unicode writerObjects/unicodeobject.cobjects/unicode_writer.go2x textTODO-

Updated 2026-05-19 after the reality-check audit. Dependencies matter: P1 inline caching is unsafe to extend until P5.4 watcher API + P7.3 type-version auto-invalidation land, because today nothing tells the specializer when a class attribute changes.

  1. P8 + P9 unblock N/A benches (independent, small). v[0] -= rhs codegen fix and int.__format__ spec parser. These remove nbody, fannkuch, json_dumps from the N/A column.
  2. P5.4 watcher API + P7.2 slot pre-population + P7.3 version invalidation ship as one PR. This unblocks P1.4 deferred arms (STORE_ATTR_INSTANCE_VALUE, STORE_ATTR_WITH_HINT) and lets the specializer trust inline caches across calls.
  3. P1.4 closure: emit the remaining LOAD_ATTR arms (METHOD_WITH_VALUES, NONDESCRIPTOR_WITH_VALUES, METHOD_LAZY_DICT, GETATTRIBUTE_OVERRIDDEN) once Py_TPFLAGS_INLINE_VALUES modelling lands; then ship the FOR_ITER / SEND / LOAD_SUPER_ATTR / CALL dispatch arms (P1.4b).
  4. P1.5 marshal persistence so .pyc files retain the warm specializer state across runs.
  5. P2.1 open the JIT gate (interp.JIT = true); validate trace projection fires. Then P2.2 + P2.3 full-file ports of Python/optimizer_bytecodes.c and Python/executor_cases.c.h, driven by the spec-1714 cases generator.
  6. P3 PyLong fast path + P10 float pool ship in parallel (independent objects/ work).
  7. P4 kind tags + P15 unicode writer ship together (writer's Finish() depends on kind detection).
  8. P6.1 frame pool, P6.3 LOAD_FAST_BORROW / STORE_FAST_STORE_FAST, P6.4 args-tuple bypass in parallel.
  9. P13 GC, P14 native modules are bench-specific; pickle / xml / sqlite cannot run today so P14 is the priority among the three.

P0 and P11 are already closed (P0 small-subset, P11 entire CFG optimizer). P12 core is closed; only P12.2 SEND tier-2 uop is open, gated on P2.3.

Current benchmark results

Captured: 2026-05-16. First end-to-end P0 small-subset run with warmed-up PyPy. Each P1-P15 PR refreshes the gopy column.

Host:

  • CPU: Apple M4
  • macOS: 15.7.7
  • Go: 1.26.3 (darwin/arm64)
  • cpython: 3.14.5 (brew)
  • PyPy: 3.11.15 v7.3.22 ($HOME/pypy3.11/)
  • gopy: v0.12.0-425-gea07e20 (branch feat/v0.12.4-lexer-tokenizer)

Method:

  • Each interpreter runs the same standalone .py files under bench/bench_sources/ via bench/run_one.sh.
  • Iteration counts tuned so cpython is in the ~30-300 ms range, so PyPy gets a JIT warmup window. The earlier draft of this table (trimmed iteration counts) showed PyPy ~ cpython, which was the JIT-compile-time artifact, not steady state.
  • cpython + PyPy: 2 warmup runs + 3 timed runs per bench.
  • gopy: 1 warmup + 2 timed runs (it is ~283x slower today; full 3+2 pushes wall time past 15 min on the slow benches).

Small subset (the day-to-day gate)

Benchmarkcpython 3.14 (ms)PyPy 3.11 (ms)gopy (ms)gopy / cpythongopy / PyPyPyPy / cpython
call_method32.4220.5078043.222407.02x3806.80x0.63x
fannkuch292.5282.56N/AN/AN/A0.28x
json_dumps97.35128.47N/AN/AN/A1.32x
nbody57.8723.90N/AN/AN/A0.41x
pidigits37.0533.34289.977.83x8.70x0.90x
regex_compile41.14140.1180286.501951.54x573.03x3.41x
richards42.7929.3081250.571898.87x2772.59x0.68x
unpack_sequence24.4320.656204.49253.94x300.53x0.84x
geomean55.1144.2415573.05282.56x351.98x0.80x

PyPy is ~1.25x faster than cpython on geomean (5/8 benches faster, 3/8 slower) which matches the published PyPy 7.3 numbers and confirms the JIT is doing its job.

gopy is at 283x cpython on geomean across the five benches that complete. That ratio compresses dramatically with P1 (specializer wire-up) alone, since without P1 every adaptive opcode short-circuits in vm/adaptive.go:41/54/73.

Small subset, re-run 2026-05-19 (post spec 1715 + 1716 compile pipeline port)

Captured: 2026-05-19 against c012ba0 on branch feat/spec-1713-p7-pyc-writer. Same host, same harness, same warmups/runs as the 2026-05-16 snapshot. The intent of this re-run was to baseline gopy after the cfg-builder bridge (1715) and the full compile-pipeline port (1716) landed on top of the 2026-05-16 binary, so the next P1-P15 PR has an honest starting line.

Benchmarkcpython 3.14 (ms)PyPy 3.11 (ms)gopy (ms)gopy / cpythongopy / PyPyPyPy / cpython
call_method29.0317.79106905.783682.79x6008.47x0.61x
fannkuch246.2171.92N/AN/AN/A0.29x
json_dumps86.47113.70N/AN/AN/A1.31x
nbody31.9823.64N/AN/AN/A0.74x
pidigits33.4628.99117.333.51x4.05x0.87x
regex_compile35.68120.05137260.513847.38x1143.39x3.37x
richards34.5526.2194072.022723.00x3588.81x0.76x
unpack_sequence21.8417.5219278.36882.57x1100.40x0.80x
geomean45.3239.1319902.16439.11x508.62x0.86x

Trend vs 2026-05-16 baseline (bench/baseline_v0124.json is frozen at the 2026-05-16 numbers, so bench/compare-baseline reports these as regressions until we refresh it):

Bench2026-05-16 (ms)2026-05-19 (ms)Delta
pidigits289.97117.33-59.5%
richards81250.5794072.02+15.8%
call_method78043.22106905.78+37.0%
regex_compile80286.50137260.51+71.0%
unpack_sequence6204.4919278.36+210.7%

Takeaways:

  • pidigits halved. That bench is GMP-shape arbitrary-precision int arithmetic, and the 1715 cfg-builder port collapsed several bytecode redundancies on the hot loop, exactly the shape where the flowgraph-level optimizer earns its keep.
  • The other four regressed. The two big-ticket changes between 2026-05-16 and 2026-05-19 are the cfg-builder bridge (1715) and the full Python/flowgraph.c + Python/assemble.c port (1716). Both paid for byte-equality parity with CPython (.pyc round-trip, L1-L4 gates green), not for execution speed. The CFG layer is doing strictly more work per compile (extra normalization passes, pseudo-jump rewriting, stackdepth recomputation), and the new layout is not yet feeding the VM any new fast paths because P1 has not landed. So the regression is the bill for parity work that unblocks P1 / P2 inline-caching and tier-2 wire-up.
  • unpack_sequence is the loudest regression (+211%). It is the bench most sensitive to per-call frame setup. Plausible attribution: the cfg-builder path now emits the CPython 3.14 prologue (RESUME + extra MAKE_CELL housekeeping) where the old flat-sequence path skipped some of it, but the VM still walks every prologue op generically. Concrete number to chase once P6.1 (frame pool) and P6.2 (LOAD_FAST_CHECK fast path) close.

This snapshot is the new "floor". The next P1-P7 PR must drag at least three of these benches back below the 2026-05-16 baseline column, or document why parity-driven cost is structural for that PR's scope.

Full corpus (release-tag and nightly only)

Populated when bench/run_full.sh lands its first end-to-end run. Until then, only the small subset above is the ship gate.

Caveats:

  • P8 and P9 are prerequisites for a complete table. The "N/A" cells become real numbers once those land.
  • The 5 ok benches above gate the P1-P7 ports: each PR must shrink the gopy / cpython column or document why a regression is acceptable.
  • The call_method ratio widened from earlier preliminary runs (487x → 2407x) when iteration counts increased. That is cpython's specializer kicking in on the warm loop while gopy stays at the generic dispatch path. After P1 ships, this ratio should compress by an order of magnitude.

Sources of truth

CPython fileLinesWhat it gives us
Python/specialize.c3500Specializer (mostly already ported)
Python/executor_cases.c.h4200The 285 tier-2 uop bodies
Python/optimizer.c2000Trace projection + tier-2 entry
Python/flowgraph.c3000CFG optimizer + peephole
Python/compile.c7000Codegen incl. aug-assign lowering
Python/genobject.c1500Generator + coroutine machinery
Python/gc.c3000Generational GC
Python/formatter_unicode.c1600Format-spec grammar
Objects/longobject.c6400Compact small-int + fast-path arith
Objects/floatobject.c2000Float + free list
Objects/unicodeobject.c16000Kind-tagged strings + writer
Objects/dictobject.c4800Open-addressing + split keys
Objects/frameobject.c1100Frame free-list
Objects/typeobject.c11000Slot caching
Include/internal/pycore_code.h600Inline cache layouts
Modules/_pickle.c8500Native pickle
Modules/_elementtree.c4000Native XML
Modules/_sqlite/6000sqlite3 bindings

Risk + scope notes

  • P1 wire-up is the single highest-leverage change. The specializer is already written and tested; flipping the Quickened flag in pythonrun//imp/ should be a one-day change with 6-10x geomean impact.
  • P3 / P5 / P7 / P10 can ship in any order; pick by who has bandwidth.
  • The 5x-faster-than-CPython aspirational target only holds on tight loops where Go's escape analysis stack-allocates frame locals and the specializer has already promoted to the type-specialized op. Geomean parity (1.5x) is the realistic ship gate.
  • P13 + P14 are bench-specific. They don't move the geomean much but unblock named benchmarks that are part of the full corpus.
  • The PyPy column is a sanity check, not a target. gopy's parity goal is against cpython; beating PyPy on specific shapes (e.g. regex_compile, where PyPy's JIT loses to cpython's C re) is a bonus, not a requirement.