Skip to main content

Lexical analysis

A Python program is a sequence of logical lines. The lexer converts the raw byte stream into a token stream that the parser consumes. This page documents every token kind, every literal form, and the rules that govern how characters become tokens.

Source-of-record: Parser/tokenizer/, Parser/lexer/lexer.c, and the CPython lexical analysis chapter.

Source encoding

By default a source file is UTF-8. A different encoding can be declared on the first or second line:

# -*- coding: latin-1 -*-

The encoding declaration matches the regex coding[=:]\s*([-\w.]+). A file with a BOM is treated as UTF-8; combining a BOM with a non-UTF-8 declaration is an error.

Line structure

ConceptDefinition
Logical lineOne or more physical lines joined by implicit or explicit continuation.
Physical lineA sequence of characters terminated by a line break.
Line break\n, \r\n, or \r. Normalised to \n internally.
Blank lineContains only spaces, tabs, formfeeds, or a comment.
Comment# to end-of-line.
Explicit joinA line ending with a non-string-quoted \ joins to the next.
Implicit joinLines inside (), [], or {} join automatically.

Indentation

Indentation produces INDENT and DEDENT tokens. The lexer keeps a stack; the first non-whitespace token on a line is compared against the top of the stack.

Rule
Tabs are replaced as if they advanced to the next multiple of 8.
Mixing tabs and spaces that produce ambiguous depth is an error.
Indentation only matters at the start of a logical line.
Blank and comment-only lines do not affect the indent stack.

Whitespace between tokens

Spaces, tabs, and formfeed characters separate tokens but produce no token of their own. They are required where two adjacent tokens would otherwise merge (if x vs ifx).

Identifiers

Identifiers follow Unicode UAX-31. The full pattern is xid_start xid_continue*, with the additions Python documents:

GroupCharacters
xid_startASCII letters, _, and Unicode letters with XID_Start.
xid_continuexid_start, ASCII digits, Unicode digits, marks.

Identifiers are NFKC-normalised before comparison. Python and Python (full-width) compare equal.

Reserved classes

ClassFormNotes
PublicnameUnrestricted use.
Soft private_nameConvention: not exported by from x import *.
Class private__nameName-mangled inside a class body.
Dunder__name__Reserved by the interpreter and stdlib.

Keywords

Hard keywords. The lexer rejects these as identifiers in every context:

False None True and as assert
async await break class continue def
del elif else except finally for
from global if import in is
lambda nonlocal not or pass raise
return try while with yield

Soft keywords

Soft keywords are reserved only in specific syntactic positions:

Soft keywordReserved in
matchThe head of a match statement.
caseThe head of a case clause inside match.
typeThe type statement (PEP 695 type alias).
_A wildcard pattern in case _:.

Soft keywords remain usable as identifiers elsewhere.

Literals

String and bytes prefixes

PrefixKindAllows escapesNotes
'' / ""stryes
r'' / R""raw strnoBackslashes are literal.
b'' / B""bytesyesASCII-only payload.
rb'' / Rb'' / bR'' / BR''raw bytesno
f'' / F""f-stringyesInterpolation with {expr}.
rf'' / fr'' / Rf'' / fR''raw f-stringno
t'' / T""t-string (PEP 750)yesTemplate literal.
rt'' / tr'' / Rt'' / tR''raw t-stringno
u'' / U""stryesTolerated for 2-to-3 holdover.

Adjacent string literals concatenate at compile time: "foo" "bar" == "foobar". The two literals must agree on kind (both str or both bytes).

String quote forms

FormNotes
'...'Single-line, single-quoted.
"..."Single-line, double-quoted.
'''...'''Multi-line, single-quoted.
"""..."""Multi-line, double-quoted.

Escape sequences

Within str and bytes (without r prefix):

EscapeMeaning
\<newline>Line continuation.
\\Backslash.
\'Single quote.
\"Double quote.
\aBell.
\bBackspace.
\fFormfeed.
\nLinefeed.
\rCarriage return.
\tHorizontal tab.
\vVertical tab.
\0 ... \777Octal byte (one to three digits).
\xhhHex byte (exactly two hex digits).
\uxxxx4-digit Unicode escape. str only.
\Uxxxxxxxx8-digit Unicode escape. str only.
\N{NAME}Unicode by name. str only.
\<other>Unrecognized; emitted as DeprecationWarning.

f-string interpolation

Within f'':

FormMeaning
{expr}Format expr via format(expr, '').
{expr!r} / !s / !aApply repr / str / ascii first.
{expr:spec}Pass spec to __format__.
{expr=}Emit expr=<value> for debugging.
{{ / }}Literal { / }.

The contents of {...} may contain nearly any expression including nested f-strings (PEP 701). Backslashes inside {...} are allowed in 3.12+.

t-string interpolation (PEP 750)

t'' produces a Template object whose .values holds the interpolated expressions and whose .strings holds the literal parts. The format specifiers inside {...} are not evaluated eagerly; they are part of the template metadata.

Numeric literals

FormTypeExamples
Decimal integerint0, 42, 1_000_000
Binary integerint0b1010, 0B1_0_1_0
Octal integerint0o755, 0O755
Hex integerint0xff, 0XFF
Floatfloat1.5, .5, 1., 1e10, 1.5e-2
Imaginarycomplex1j, 1J, 1.5j

Underscore digit separators are allowed anywhere between digits. Leading zeros on decimal integers (010) are a syntax error.

Operators and delimiters

GroupTokens
Arithmetic+, -, *, /, //, %, **, @
Comparison==, !=, <, >, <=, >=
Bitwise&, |, ^, ~, <<, >>
Assignment=, :=, +=, -=, *=, /=, //=, %=, **=, @=, &=, |=, ^=, <<=, >>=
Punctuation(, ), [, ], {, }, ,, :, ., ;, ->, ..., @, =
Star / DStar*, **

The walrus := is allowed only inside subexpressions and parens; the parser rejects it at statement top-level.

End-of-line and end-of-file tokens

TokenEmitted when
NEWLINEThe end of a logical line.
NLThe end of a physical line that is not a logical break.
INDENTThe first token on a more-indented logical line.
DEDENTThe first token on a less-indented logical line.
ENDMARKEROne past the last token in the file.

NL is needed by tokenize consumers; the parser ignores it.

Gopy status

AreaState
Encoding declarationsComplete.
UTF-8, latin-1, ASCII sourceComplete.
Identifiers (UAX-31 + NFKC)Complete.
All hard and soft keywordsComplete.
All string prefixes including t""Complete.
All escape forms including \N{...}Complete.
f-string nesting (PEP 701)Complete.
_tokenize parity with CPythonComplete as of v0.12.4.

Source lives under tokenize/, token/, and parser/lexer/.

Reference