Skip to main content

1710. v0.12.4 lexer/tokenizer full port

Checklist

Sources to fully port (CPython 3.14)

Status legend: DONE = ported in full and verified, WIP = port underway, TODO = not started.

CPython sourceC LOCgopy destinationGo LOCStatusCommit
Parser/lexer/buffer.c76parser/lexer/buffer.go56DONE5374e84
Parser/lexer/lexer.c1635parser/lexer/lexer.go (+ fstring.go + onechar.go)957 + 240 + (operator table)DONEfunction-map + verify_* (#612), check_coding_spec (#613), maybe_raise_syntax_error_for_string_prefixes (#617), update_ftstring_expr (#618) all landed
Parser/lexer/state.c151parser/lexer/state.go402DONEd157189
Parser/tokenizer/helpers.c581parser/lexer/helpers.go (+ encoding subset in parser/lexer/source.go)176 + 238DONEde537e1
Parser/tokenizer/file_tokenizer.c493parser/lexer/driver_file.go120DONE268c8f8
Parser/tokenizer/readline_tokenizer.c134parser/lexer/driver_readline.go50DONE268c8f8
Parser/tokenizer/string_tokenizer.c148parser/lexer/driver_string.go112DONE268c8f8
Parser/tokenizer/utf8_tokenizer.c55parser/lexer/driver_string.go (utf-8 path)(shared)DONE268c8f8
Python/Python-tokenize.c445module/_tokenize/module.go366DONE4d3b1e8
Lib/keyword.py64stdlib/keyword.py64DONEbyte-equal vendor
Lib/tokenize.py598stdlib/tokenize.py598DONEbyte-equal vendor
Lib/tabnanny.py338stdlib/tabnanny.py338DONEbyte-equal vendor

Gate tests to land green under test/cpython/

TestLOCStatusCommit
test_keyword.py56DONE (10/11 sub-tests green; the eleventh hits a parser-generator gap unrelated to lexer/tokenizer, parser: generated rule bodies not yet emitted. Also mirrored at stdtest/test_keyword.py and gated via TestStdtestCorpus.)
test_utf8source.py41DONE (3/3 sub-tests green; mirrored at stdtest/test_utf8source.py)
test_tabnanny.py354DONE (exits 0 after typed UnicodeDecodeError + surrogateescape decode fix; mirrored under stdtest/test_tabnanny.py)3066fe3
test_source_encoding.py547TODO (imports clear; first hang is BytesSourceEncodingTest.test_crcrcrlf, which is exec(bytes) inside captured_stdout; the underlying gap is the VM's exec(bytes) path, not lexer/tokenizer).
test_tokenize.py3480WIP. Three plumbing blockers cleared: (a) drainReadline encoding inversion fixed in 538ab52; (b) FORMAT_WITH_SPEC now routes through objects.Format in 5bd8455; (c) scanOperator now emits the specific operator token type via the new _PyToken_OneChar/TwoChars/ThreeChars port in 669c11f. unittest.main reaches the test bodies now, but most sub-tests still ERROR and one hangs: tokenize position output is off (e.g. 1 + 1 emits the implicit NEWLINE at (2, 0) (2, 2) instead of (1, 5) (1, 6), and the second check_tokenize reorders NEWLINE before COMMENT and reports it on the wrong line). The remaining work is a token-position parity pass over parser/lexer/lexer.go against Parser/lexer/lexer.c, not within the scope of this commit batch.538ab52, 5bd8455, 669c11f

Goal

Replace the partial lexer/tokenizer port that grew up alongside the v0.5.5 parser work with a one-to-one translation of every CPython 3.14 source file in the subsystem, then pin the result with the five Lib/test/test_* files the 1700 spec already assigned to this panel.

Today parser/lexer/lexer.go is 633 lines against CPython's 1635-line Parser/lexer/lexer.c. The delta is the gap this spec closes. The v0.12.4 series treats every subsystem the same way: port full, then gate on the upstream tests.

Sources of truth

Lexer / tokenizer C sources (3.14):

CPython fileLinesgopy destination
Parser/lexer/buffer.c76parser/lexer/buffer.go
Parser/lexer/lexer.c1635parser/lexer/lexer.go
Parser/lexer/state.c151parser/lexer/state.go
Parser/tokenizer/helpers.c581parser/lexer/helpers.go
Parser/tokenizer/file_tokenizer.c493parser/lexer/driver_file.go
Parser/tokenizer/readline_tokenizer.c134parser/lexer/driver_readline.go
Parser/tokenizer/string_tokenizer.c148parser/lexer/driver_string.go
Parser/tokenizer/utf8_tokenizer.c55parser/lexer/driver_string.go
Python/Python-tokenize.c(see file)module/_tokenize/

Python sources (3.14):

CPython fileLinesgopy destination
Lib/keyword.py64module/keyword/
Lib/tokenize.py598module/tokenize/
Lib/tabnanny.py338module/tabnanny/

Gate tests live at ~/github/python/cpython/Lib/test/: test_keyword.py, test_utf8source.py, test_source_encoding.py, test_tabnanny.py, test_tokenize.py.

Workflow

The spec follows the durable port-not-patch / full-subsystem rule. The work is broken into the phases below; each phase is one or more PRs.

Phase 1: audit + fill the C-tokenizer port

For every Parser/lexer/*.c and Parser/tokenizer/*.c function, find the Go counterpart in parser/lexer/. Where a function is missing, port it with a // CPython: <file>:<line> <function> citation. Where a function is present but diverges from CPython, rewrite it to match. The deliverable is parser/lexer Go LOC roughly matching the upstream C LOC, with every CPython function accounted for.

Phase 2: replace the _tokenize stub

module/_tokenize/module.go raises NotImplementedError on every call. Port Python/Python-tokenize.c end-to-end: TokenizerIter_Type, tokenizeriter_new, tokenizeriter_next, the helpers that materialize TokenInfo tuples, and the readline / encoding plumbing. module/tokenize/ (next phase) drives this iterator directly.

Phase 3: vendor Lib/keyword.py, Lib/tokenize.py, Lib/tabnanny.py

The Python layer is a verbatim vendoring under module/keyword/, module/tokenize/, module/tabnanny/ (following the standing rule: "module ports under module/, name = CPython public name minus the py prefix"). The Python files stay byte-equal to upstream so future 3.14.x point releases rebase via git diff.

Phase 4: land the gate tests

For each of the five tests:

  1. Copy the test file from ~/github/python/cpython/Lib/test/ into test/cpython/ verbatim.
  2. Run it through test/regrtest.
  3. If green, mark the 1700 panel row done and move to the next.
  4. If red, fix the divergence in parser/lexer/, module/_tokenize/, or the vendored Python file. Never edit the test.

Phase 5: flip 1700

Once every gate is green, flip task #484 ("test e2e v0.5.5 — lexer panel") to done and update the 1700 checklist row.

Sub-system blockers (DFS)

The four pending gate rows each depend on a chain of sub-system gaps outside the lexer/tokenizer scope. Closing 1710 means walking each chain depth-first and porting whatever's missing until the gate runs green. Status legend: DONE = landed and verified, WIP = in progress, TODO = not started, BLOCKED = waiting on a larger sub-system spec.

test_tokenize.py chain

#TaskSub-systemSurfaceStatusCommit
1T1numbers/longint.__pow__(int, neg_int) returns float; float __pow__ slot wiredDONE5d9c85d
2T1.5VM attr machineryAttrDictHolder lets C-port subclasses carry an instance dict; _random.RandomObject opts inDONE7d9e729
3T1.6module/osbind os.fsdecode + os.fsencode on the inittab moduleDONE9bd4675
4T1.7stdlib vendorbyte-equal Lib/bisect.py and Lib/tempfile.py under stdlib/DONE4350edf
5T6asynciounittest.mock imports asyncio; full port tracked in spec 1711BLOCKED

test_utf8source.py chain

Suite runs end-to-end; 1/3 sub-tests green. The remaining two fail in unrelated sub-systems:

#TaskSub-systemSurfaceStatusCommit
1T2builtin compile() + str.encodeaccept bytes / bytearray (route through lexer.FromBytes); str.encode honors its encoding arg via codecs.EncodeDONE9d03f23
2T3test fixturesvendor Lib/test/tokenizedata/ (bad_coding*, badsyntax_, coding20731, tokenize_tests-) under stdlib/test/tokenizedata/DONE0c3da66
3T4module/sysbind sys.exit + setrecursionlimit + getrecursionlimit + getrefcount on the inittab sys module via CurrentThreadHookDONE7e5bc6d
4T3.1lexer non-utf-8 checklexer.ValidateUTF8 flags the first non-utf-8 byte and the parser surfaces a SyntaxError so badsyntax_pep3120 raises at import. Also added a Sequence.Contains slot for str so the test's 'utf-8' in msg.lower() substring check works.DONE6db8913

test_source_encoding.py chain

#TaskSub-systemSurfaceStatusCommit
1T5.1stdlib vendorvendor Lib/opcode.py (122 lines) plus C-port the _opcode inittab module (has_arg/has_const/has_name/has_jump/has_free/has_local/has_exc, get_nb_ops, intrinsic + special-method name lists). _opcode_metadata.py lands as a verbatim vendor since it's pure-Python data. stack_effect / get_executor ship as documented stubs (they're never called during opcode.py or dis.py import).DONE2512db3
2T5.2stdlib vendorvendor Lib/dis.py (1157 lines) verbatim, depends on T5.1. Also widens module/_collections _tuplegetter so __doc__ is writable (matches CPython tuplegetter_members PyMemberDef flags=0), which dis.py:314 exercises.DONE7f352c2
3T5.3stdlib vendorminimal-shim stdlib/importlib/__init__.py + stdlib/importlib/machinery.py. Only SOURCE_SUFFIXES, BYTECODE_SUFFIXES, EXTENSION_SUFFIXES, all_suffixes(), and ModuleSpec are observable from inspect.py; the full bootstrap port is deferred.DONEeb13f02
4T5.4stdlib vendorvendor Lib/inspect.py (3409 lines) verbatim, depends on T5.1–T5.3. Two runtime gaps surfaced at import time: (a) type.__dict__["__dict__"] had no entry, so a __dict__ getset descriptor was registered on typeType; (b) _types was missing WrapperDescriptorType, MethodWrapperType, ClassMethodDescriptorType, which now alias to the closest gopy types (method_descriptor / method / classmethod).DONE7e3f024

DFS note: T5 was originally one row but inspect pulls in disopcode_opcode (C module) → _opcode_metadata (generated C module), plus importlib.machinery. The four-step breakdown above matches the actual port order.

test_tabnanny.py chain

#TaskSub-systemSurfaceStatusCommit
1T6asyncioport the asyncio package (event loop, transports, protocols, futures, tasks, streams, subprocess, queues, locks) as its own specBLOCKED

DFS execution order, smallest fix first: T1 → T1.5 → T1.6 → T1.7 → T4 → T2 → T3 → T3.1 → T5.1 → T5.2 → T5.3 → T5.4 → T6. Each task gets its own commit and an entry in stdtest/MANIFEST.txt when the gate it unblocks lands green.

Out of scope

  • tokenizedata/ test fixtures under Lib/test/tokenizedata/ are in scope only as far as the five gate tests reference them.
  • IDLE's tokenizer fork (Lib/idlelib/) stays out of scope; IDLE is on the 1700 deferred list.
  • The PEG parser layer that consumes tokens (Parser/parser.c and friends) is a separate subsystem and gets its own v0.12.4 spec when its turn comes.