Skip to main content

Codec registry

Codecs convert between sequences of bytes and sequences of code points. Python's codec system is centralised in Lib/codecs.py and Lib/encodings/; a Python runtime must reproduce both the API and the byte-equal behaviour of every codec.

Source-of-record: Lib/codecs.py, Lib/encodings/, Python/codecs.c, codecs module docs.

Codec functions

The public functions in codecs:

FunctionReturns
codecs.encode(obj, encoding='utf-8', errors='strict')Encoded bytes.
codecs.decode(obj, encoding='utf-8', errors='strict')Decoded str.
codecs.lookup(encoding)CodecInfo named tuple.
codecs.lookup_error(name)Error handler.
codecs.register(search_function)Add a codec search function.
codecs.register_error(name, fn)Register error handler.
codecs.open(filename, mode='r', encoding=None, errors='strict', buffering=-1)File opener using a codec.
codecs.iterencode(iter, encoding, errors='strict')Streaming encode.
codecs.iterdecode(iter, encoding, errors='strict')Streaming decode.
codecs.EncodedFile(file, data_encoding, file_encoding=None, errors='strict')Transparent transcoder.

CodecInfo

codecs.lookup(name) returns:

CodecInfo(encode, decode, streamreader, streamwriter,
incrementalencoder, incrementaldecoder,
name)
FieldSignature
encode(str, errors='strict') -> (bytes, length_consumed).
decode(bytes, errors='strict') -> (str, length_consumed).
streamreaderReader(stream, errors='strict') class.
streamwriterWriter(stream, errors='strict') class.
incrementalencoderIncrementalEncoder(errors='strict') class.
incrementaldecoderIncrementalDecoder(errors='strict') class.

Error handlers

NameBehaviour
'strict'Raise UnicodeError (default).
'ignore'Skip the offending sequence.
'replace'Replace with U+FFFD (decode) or ? (encode).
'xmlcharrefreplace'Encode: &#NNNN;.
'backslashreplace'Encode: \xNN, \uNNNN, \UNNNNNNNN. Decode: \xNN.
'surrogateescape'Encode/decode: round-trip bytes 0x80-0xFF through U+DC80-U+DCFF.
'surrogatepass'UTF codecs: allow lone surrogates.
'namereplace'Encode: \N{NAME}.

A custom handler is a callable fn(exc: UnicodeError) -> (replacement_str_or_bytes, new_index).

Standard codecs

Unicode codecs

NameAliasesDescription
utf-8U8, UTF, utf8UTF-8.
utf-8-sig-UTF-8 with BOM.
utf-16U16, utf16UTF-16 with BOM.
utf-16-be / utf-16-le-UTF-16 explicit endian, no BOM.
utf-32U32, utf32UTF-32 with BOM.
utf-32-be / utf-32-le-UTF-32 explicit endian, no BOM.
utf-7-UTF-7.
ascii646, us-asciiASCII.
latin-1iso-8859-1, 8859, cp819Latin-1.
unicode-escape-Python source unicode escapes.
raw-unicode-escape-Same minus \u/\U.

Byte-string codecs

NameDescription
idnaRFC 3490 internationalised domain names.
mbcsWindows ANSI codepage.
oemWindows OEM codepage.

Codepage codecs

The cp* family is large; selected entries:

NameDescription
cp037EBCDIC US.
cp437DOS US.
cp720Arabic.
cp737Greek.
cp775Baltic.
cp850 / cp852 / cp855 / cp857 / cp858 / cp860 / cp861 / cp862 / cp863 / cp864 / cp865 / cp866 / cp869Various DOS code pages.
cp874Thai.
cp932Japanese Shift-JIS.
cp949Korean.
cp950Traditional Chinese.
cp1140 / cp1250 ... cp1258Windows codepages.

Asian codecs

NameDescription
big5, big5hkscsTraditional Chinese.
gb2312, gbk, gb18030Simplified Chinese.
euc-jp, euc-jis-2004, euc-jisx0213EUC-JP family.
euc-krEUC-KR.
iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-2004, iso-2022-jp-3, iso-2022-jp-extISO-2022-JP variants.
iso-2022-krISO-2022-KR.
shift-jis, shift-jisx0213, shift-jis-2004Shift-JIS variants.

Special non-text codecs

NameFunction
base64_codecRFC 3548 base64.
bz2_codecbzip2.
hex_codecHex.
quopri_codecQuoted-printable.
rot_13ROT-13 (str -> str).
uu_codecUUEncode.
zlib_codeczlib.

These are accessible via codecs.encode(b, 'hex_codec') etc.; they do not roundtrip through str.encode/bytes.decode because they do not convert between bytes and str.

Incremental codecs

IncrementalEncoder and IncrementalDecoder classes provide state for streaming:

MethodRole
encode(input, final=False)Process a chunk; buffer state.
decode(input, final=False)Same for decoding.
reset()Drop buffered state.
getstate() / setstate(state)Serialise buffer.

final=True flushes any buffered state.

Stream classes

StreamReader and StreamWriter wrap a binary stream:

MethodRole
read(size=-1)Decode size characters.
readline(size=-1)Read one decoded line.
readlines(sizehint=-1)Decode all.
write(str)Encode and write.
writelines(strings)Same for many.
seek / tellForwarded to underlying stream.

A search function takes a normalised encoding name (lowercase, hyphens replaced) and returns a CodecInfo or None. Search functions are tried in registration order. Built-in lookup loads from encodings.<name> via the import system.

Gopy status

AreaState
codecs module surfaceComplete.
Built-in unicode codecs (UTF-8/16/32, ASCII, Latin-1)Complete.
Codepage codecsComplete for the common subset; rare codepages may fall back.
Asian codecsPartial.
Non-text codecs (hex, base64, zlib)Complete.
Error handler protocolComplete.
Incremental codecsComplete.
Stream classesComplete.
surrogateescape and surrogatepassComplete.

Reference

  • CPython 3.14: codecs module.
  • Lib/codecs.py, Lib/encodings/. Python side.
  • Python/codecs.c. C runtime side.
  • codecs/, module/_codecs/. gopy's port.