1717. Modules/unicodedata.c full port
Rule
Same as 1704 / 1705 / 1708 / 1709. The deliverable is a Go file (or
files) under module/unicodedata/ whose function list 1:1 covers
Modules/unicodedata.c. The Unicode Character Database tables live
in a generated Go file emitted from CPython 3.14's
Modules/unicodedata_db.h + Modules/unicodename_db.h so the
runtime carries no Python build dependency. Once this spec lands the
stdlib/test/support/os_helper.py:30 import unicodedata resolves,
unicodedata.normalize('NFD', ...) returns CPython-equivalent
strings, and the test_tokenize.py panel row in spec 1710 advances
past the missing-module wall.
Why this spec exists
CPython exposes unicodedata as a built-in C extension. The module
publishes:
normalize(form, str)returning the NFC / NFD / NFKC / NFKD form. Required bystdlib/test/support/os_helper.py:31,stdlib/urllib/parse.py:436.is_normalized(form, str)returning True when str is already in the requested form.category(char)returning the two-letter general category ("Lu", "Ll", "Mn", ...).bidirectional(char)returning the bidi class.combining(char)returning the combining-class integer.mirrored(char)returning 1 if the character has Bidi_Mirrored.east_asian_width(char)returning the EAW class ("F", "H", "W", "Na", "A", "N"). Required bystdlib/traceback.py:975.decimal(char[, default]),digit(char[, default]),numeric(char[, default])returning the numeric value.decomposition(char)returning the canonical/compatibility decomposition as a hex string.name(char[, default])returning the Unicode 1.0/3.0 character name. Required transitively bystdlib/re/_parser.py:349's\N{NAME}handling.lookup(name)returning the character for a name.unidata_versionconstant.- The
UCDtype with aucd_3_2_0pre-built instance for legacy callers (CPython keeps a Unicode 3.2 snapshot for IDNA).
module/unicodedata/ does not exist yet. Every importer above falls
through to a ModuleNotFoundError, which is exactly what blocks the
spec 1710 test_tokenize.py panel row (the import chain runs
unittest.mock -> pkgutil -> ... -> os_helper.py:30).
Sources of truth
Modules/unicodedata.c(CPython 3.14.5): the function bodies.Modules/unicodedata_db.h: the auto-generated UCD records, decomposition tables, combining/quickcheck/EAW/numeric arrays.Modules/unicodename_db.h: the name-lookup phrasebook and trie.Tools/unicode/makeunicodedata.py: the generator that emits the two_db.hfiles fromLib/test/cjkencodings/UnicodeData.txt,DerivedNormalizationProps.txt, etc.
The gopy port uses Go's unicode.RangeTables for the generic
category check fall-through (so basic category('a') == 'Ll'
matches without our own table), but every other property has to
come from the CPython UCD tables because Go's unicode package
doesn't expose combining classes, decomposition mappings, or
character names.
Checklist
| Phase | Title | Status | Commit |
|---|---|---|---|
| P1 | module/unicodedata/ skeleton + imp.AppendInittab + stdlibinit wiring + unidata_version constant | done | 106b099 |
| P2 | Tools/unicodedata_go/ generator: parse CPython's UCD source headers and emit module/unicodedata/data_gen.go (records, decomposition, combining, EAW, numeric, quickcheck) | done | 106b099 |
| P3 | normalize(form, str) for NFC / NFD / NFKC / NFKD. Walks the decomposition tables + canonical ordering algorithm. Unblocks os_helper.py:31. | done | 106b099 |
| P4 | is_normalized(form, str) (quickcheck path + canonical verify) | done | 106b099 |
| P5 | category, bidirectional, combining, mirrored. category falls back to Go's unicode tables when the CPython record returns the default value | done | 106b099 |
| P6 | east_asian_width(char) returning "F/H/W/Na/A/N". Unblocks stdlib/traceback.py:975 for non-ASCII traceback rendering | done | 106b099 |
| P7 | decimal, digit, numeric (with default arg semantics: raise ValueError or return default) | done | a34be17 |
| P8 | Character-name generator: emit name trie from unicodename_db.h. Wire name(char[, default]) and lookup(name). Unblocks \N{NAME} in re/_parser.py | done | d783d9d |
| P9 | decomposition(char) returning the hex form ("<compat> 0020" etc.) | done | 106b099 |
| P10 | UCD type + ucd_3_2_0 legacy instance (Unicode 3.2 snapshot used by IDNA) | done | d48fae8 |
| P11 | Re-run the test_tokenize.py panel row from spec 1710, flip the row to either green or to the next out-of-scope blocker | done | (this commit) |
Phase notes
P1: module skeleton
- New package
module/unicodedata/withmodule.gocarryinginit()->imp.AppendInittab("unicodedata", buildModule)andbuildModule()returning an*objects.Modulewith the function table. - Module-level constants:
unidata_version(matches theUNIDATA_VERSIONdefine from the CPython header we generate from). - Blank-import line in
stdlibinit/registry.go. - Every function starts as a
not implementedstub that returns aNotImplementedError. Subsequent phases swap the stubs for real implementations.
P2: UCD data generator
Tools/unicodedata_go/main.goparses CPython's pre-generatedModules/unicodedata_db.h(the C arrays + the index tables) andModules/unicodename_db.h, then emits a Go file with the same data. Parsing the already-generated C tables (rather than re-running CPython'smakeunicodedata.pyagainst the UCD text files) keeps the generator small and matches the CPython runtime byte-for-byte.- Output:
module/unicodedata/data_gen.goplus optionally adata_name_gen.gosplit-out for the name table to keep file sizes manageable. - Generator is run once and the output is checked in. The runtime build has no Python dependency.
- The
Tools/regen-unicodedataMakefile target re-runs the generator against the current$CPYTHONcheckout.
P3: normalize
- Implements the Unicode Normalization Algorithm: full canonical / compatibility decomposition, canonical reordering (sort by Canonical_Combining_Class within each run of nonstarters), then canonical composition for NFC / NFKC.
- Backed by the decomposition tables emitted in P2 (CPython packs
these into
decomp_data,decomp_index,decomp_index2,decomp_prefix). CPython: Modules/unicodedata.c:884 nfc_nfkc / nfd_nfkd_impl
P4: is_normalized
- Quickcheck NFC_QC / NFD_QC / NFKC_QC / NFKD_QC via the bitfield emitted in P2, then a verification pass that normalizes and string-compares.
CPython: Modules/unicodedata.c:819 QuickcheckResult
P5: category / bidirectional / combining / mirrored
- Read the per-codepoint
_PyUnicode_DatabaseRecordindex from the generated tables. Map the integer fields back to the string category / bidirectional names via the constant arrays CPython ships inunicodedata_db.h(_PyUnicode_CategoryNames,_PyUnicode_BidirectionalNames). - Falls back to Go's
unicode.Categoriesonly as a sanity check; the per-codepoint record is authoritative. CPython: Modules/unicodedata.c:131 unicodedata_UCD_category_impl
P6: east_asian_width
- Same record lookup, but returns the EAW two-letter string from
_PyUnicode_EastAsianWidthNames. CPython: Modules/unicodedata.c:290 unicodedata_UCD_east_asian_width_impl
P7: numeric
decimal(c)returns the decimal digit value or raisesValueError.digit(c)returns the digit value (broader: tally marks etc.).numeric(c)returns the numeric value as a Gofloat64wrapped in a*objects.Float.- Backed by the
_PyUnicode_DecimalDigitMapping,_PyUnicode_DigitMapping,_PyUnicode_NumericValuesarrays. CPython: Modules/unicodedata.c:97 unicodedata_UCD_decimal_impl
P8: name / lookup
- Hardest table to port: CPython compresses character names with a word phrasebook + a hash trie keyed by the phrasebook indices. The generator emits both as Go data.
name(c)walks the trie forward to reconstruct the name string;lookup(name)hashes the input and walks the inverse table.- Falls back to the algorithmic ranges (Hangul syllables, CJK
Unified Ideographs) hard-coded in CPython:
Modules/unicodedata.c:1004hangul_syllables,Modules/unicodedata.c:1124is_unified_ideograph. CPython: Modules/unicodedata.c:1513 unicodedata_UCD_name_implCPython: Modules/unicodedata.c:1551 unicodedata_UCD_lookup_impl
P9: decomposition
- Returns the printable decomposition:
"<compat> 0020"for compatibility forms,"0061 0301"for canonical. Empty string when the character has no decomposition. CPython: Modules/unicodedata.c:319 unicodedata_UCD_decomposition_impl
P10: UCD type + ucd_3_2_0
unicodedata.UCDis a type whose instances carry a frozen snapshot of the Unicode database. Most callers use the module- level helpers, which dispatch through the default UCD instance (Unicode 16.0). The legacyucd_3_2_0instance backs IDNA's Unicode 3.2-stable behavior.- For gopy, the 3.2 snapshot is generated alongside the main data
table in P2 (Tools/unicodedata_go reads
unicodedata_db.h's change-record list and rolls codepoints back to their pre-3.2 values). CPython: Modules/unicodedata.c:86 DB_members
P11: panel re-run
- Rebuild gopy, run the corpus entry for
test_tokenize.py. If unicodedata closes the only remaining gap the row flips green; if a further missing module surfaces (e.g._zoneinfo), spec 1710's panel row stays pending against the new blocker and a follow-up is filed. - 2026-05-21 re-run:
gopy test/cpython/test_tokenize.pynow loads the suite pastfrom test.support import os_helper(which is whatimport unicodedatawas blocking).unittest.mainreaches theCTokenizeTestbodies and reports passes / errors / failures until the binary crashes insideCTokenizeTest.test_string. That crash is the same token-position parity gap the MANIFEST note already calls out (parser/lexer/lexer.go vs Parser/lexer/lexer.c), not a new unicodedata-side blocker. The corpus row stays pending against the lexer-parity work, and thev0.12.5version stamp records that the unicodedata wall is down.
Out of scope
- Anything that needs
UnicodeData.txtpast Unicode 16.0. The generator stays pinned to whatever$CPYTHON/Modules/unicodedata_db.hships at the gopy CPython tag. - IDNA itself.
unicodedata.ucd_3_2_0is the only thing IDNA needs from this module; the IDNA codec sits instdlib/encodings/idna.pyand is governed by a separate spec.
Function-level audit
To be filled in alongside the implementation phases: each
unicodedata.c function gets a row with its CPython line, the
Go destination function, and a status tick.
| CPython function | CPython line | Go destination | Status |
|---|---|---|---|
_getrecord_ex | 59 | module/unicodedata/module.go getRecord | done |
unicodedata_UCD_decimal_impl | 132 | module/unicodedata/numeric.go decimalBuiltin | done |
unicodedata_UCD_digit_impl | 184 | module/unicodedata/numeric.go digitBuiltin | done |
unicodedata_UCD_numeric_impl | 218 | module/unicodedata/numeric.go numericBuiltin | done |
gettyperecord (unicodectype.c) | 43 | module/unicodedata/numeric.go getTypeRecord | done |
_PyUnicode_ToDecimalDigit (unicodectype.c) | 104 | module/unicodedata/numeric.go decimalBuiltin | done |
_PyUnicode_ToDigit (unicodectype.c) | 121 | module/unicodedata/numeric.go digitBuiltin | done |
_PyUnicode_ToNumeric (unicodetype_db.h) | 4513 | module/unicodedata/type_gen.go numericValues | done |
unicodedata_UCD_category_impl | 264 | module/unicodedata/properties.go categoryBuiltin | done |
unicodedata_UCD_bidirectional_impl | 291 | module/unicodedata/properties.go bidirectionalBuiltin | done |
unicodedata_UCD_combining_impl | 320 | module/unicodedata/properties.go combiningBuiltin | done |
unicodedata_UCD_mirrored_impl | 348 | module/unicodedata/properties.go mirroredBuiltin | done |
unicodedata_UCD_east_asian_width_impl | 375 | module/unicodedata/properties.go eastAsianWidthBuiltin | done |
unicodedata_UCD_decomposition_impl | 415 | module/unicodedata/decomposition.go decompositionBuiltin | done |
get_decomp_record | 488 | module/unicodedata/decomposition.go getDecompRecord | done |
nfd_nfkd | 514 | module/unicodedata/normalize.go nfdNFKD | done |
find_nfc_index | 649 | module/unicodedata/normalize.go findNFCIndex | done |
nfc_nfkc | 665 | module/unicodedata/normalize.go nfcNFKC | done |
is_normalized_quickcheck | 820 | module/unicodedata/normalize.go isNormalizedQuickcheck | done |
unicodedata_UCD_is_normalized_impl | 885 | module/unicodedata/normalize.go isNormalizedBuiltin | done |
unicodedata_UCD_normalize_impl | 953 | module/unicodedata/normalize.go normalizeBuiltin | done |
_dawg_decode_varint_unsigned | 1058 | module/unicodedata/name.go dawgDecodeVarint | done |
_dawg_match_edge | 1075 | module/unicodedata/name.go dawgMatchEdge | done |
_dawg_decode_node | 1107 | module/unicodedata/name.go dawgDecodeNode | done |
_dawg_node_is_final | 1116 | module/unicodedata/name.go dawgNodeIsFinal | done |
_dawg_node_descendant_count | 1124 | module/unicodedata/name.go dawgNodeDescendantCount | done |
_dawg_decode_edge | 1172 | module/unicodedata/name.go dawgDecodeEdge | done |
_lookup_dawg_packed | 1196 | module/unicodedata/name.go lookupDawgPacked | done |
_inverse_dawg_lookup | 1242 | module/unicodedata/name.go inverseDawgLookup | done |
_getucname | 1296 | module/unicodedata/name.go getUCName | done |
find_prefix_id | 1035 | module/unicodedata/name.go findPrefixID | done |
find_syllable | 1378 | module/unicodedata/name.go findSyllable | done |
parse_hex_code | 1411 | module/unicodedata/name.go parseHexCode | done |
_getcode | 1441 | module/unicodedata/name.go getCode | done |
unicodedata_UCD_name_impl | 1552 | module/unicodedata/name.go nameBuiltin | done |
unicodedata_UCD_lookup_impl | 1585 | module/unicodedata/name.go lookupBuiltin | done |
| Hangul syllables table | 1004 | module/unicodedata/name_gen.go hangulSyllables | done |
change_record struct | 45 | module/unicodedata/ucd.go changeRecord | done |
PreviousDBVersion struct | 73 | module/unicodedata/ucd.go UCD | done |
new_previous_version | 97 | module/unicodedata/ucd.go newUCD | done |
get_change_3_2_0 (unicodedata_db.h) | 8336 | module/unicodedata/ucd.go getChange320 | done |
normalization_3_2_0 (unicodedata_db.h) | 8347 | module/unicodedata/ucd.go normalize320Func | done |
PyInit_unicodedata | 1734 | module/unicodedata/module.go buildModule | done |