Skip to main content

Hashing

Hashing turns an object into an integer that two equal objects share. The integer feeds dicts, sets, frozensets, and any algorithm built on hash-based dispatch. Python's hash rules are specific: equal objects must hash to the same value; the hash must be stable for the lifetime of the object; user-defined __hash__ implementations must satisfy both invariants or get "unhashable" errors at use sites.

CPython implements hashing with SipHash-1-3 for variable-length inputs (bytes, str, memoryview), the integer-shaping formula for numeric types, and identity hashing for objects without a custom __hash__. The per-process hash secret randomises the inputs so that adversarial collisions cannot be precomputed across runs.

gopy implements the same. The package is hash/.

Where the code lives

FileRoleCPython counterpart
hash/hash.goPublic API: Buffer, BufferFNV. Constants. FuncDef/GetFuncDef.Python/pyhash.c
hash/siphash.goSipHash-1-3 implementation.Python/pyhash.c siphash13
hash/secret.goThe per-process 16-byte hash secret.Python/pyhash.c PyHash_State
hash/fnv.goFNV-1a fallback (rarely used in practice).Python/pyhash.c fnv

The constants

// hash/hash.go constants
const (
HashBits = 61 // Mersenne prime exponent
HashModulus = (1 << HashBits) - 1
HashInf = 314159
HashImag = 1000003
HashAlgorithm = "siphash13"
)

The hash modulus is a Mersenne prime. Reducing modulo a prime distributes hash values evenly across the integer range and preserves the property that hash(-1) != -1 (which would collide with the error sentinel).

HashInf is the canonical hash of float('inf'). HashImag is the multiplier used in Complex.__hash__.

HashAlgorithm is reported by sys.hash_info.algorithm.

The variable-length hash

Buffer(src []byte) int64 is the primary entry point. It hashes a byte slice and returns a 64-bit signed integer reduced to the positive range (with -1 remapped to -2).

// hash/hash.go Buffer
func Buffer(src []byte) int64

Implementation: SipHash-1-3 with the per-process secret as the key. The output is a 64-bit unsigned value; conversion to signed preserves the bit pattern; the -1-to--2 remap follows.

BufferFNV is the FNV-1a alternative. CPython supports both at build time; gopy supports both at run time. BufferFNV is the fallback if Buffer ever needs to be swapped at startup.

// hash/hash.go BufferFNV
func BufferFNV(src []byte) int64

SipHash-1-3

// hash/siphash.go siphash13
func siphash13(k0, k1 uint64, src []byte) uint64

SipHash is a keyed pseudo-random function designed by Aumasson and Bernstein. The "1-3" variant uses one compression round per block and three finalisation rounds, the same configuration CPython uses. The 1-3 variant is fast and resists hash-flooding attacks for typical workloads.

The implementation is straight Go translated from the canonical C reference. Block processing reads 64-bit words from src in little-endian order; the final partial block packs the remaining bytes into a tagged word that includes the message length.

The hash secret

The secret is 16 bytes. The first 8 bytes are k0; the next 8 are k1.

// hash/secret.go HashSecret
var HashSecret [16]byte

The secret is initialised at program start. The source is one of:

  • crypto/rand (the default; reads from the OS entropy source).
  • The PYTHONHASHSEED environment variable, parsed as an integer. The interpreter expands the integer to 16 bytes deterministically for reproducible runs (debugging, fuzzing).
  • The literal value zero, when PYTHONHASHSEED=0. This disables hash randomisation entirely. Programs that depend on stable hash output across runs (rare, and discouraged) can use this.

The secret is shared across goroutines because it does not change during the program's lifetime. Reads are unsynchronised.

// hash/secret.go siphashKeys
func siphashKeys() (uint64, uint64)

// hash/secret.go fnvKeys
func fnvKeys() (uint64, uint64)

The siphashKeys accessor returns k0/k1 from the secret. The fnvKeys accessor returns the per-process FNV offsets.

Numeric hashing

Numeric types follow a different scheme. The hash must satisfy:

  • hash(int(x)) == hash(float(x)) when both produce the same mathematical value.
  • hash(complex(x, 0)) == hash(x).
  • hash(rational) == hash(equivalent int/float).

The unifying formula is in Include/cpython/pyhash.h:

For rational p/q (gcd(p, q) == 1):
h = pow(q, HashModulus - 2, HashModulus) # modular inverse of q
h = (|p| * h) % HashModulus
if p < 0: h = -h
if h == -1: h = -2
return h

For integers, q = 1, so the formula reduces to |p| mod HashModulus. For floats, the value is decomposed into mantissa and exponent, the exponent is scaled into the modular ring, and the mantissa is combined.

The Int.Hash, Float.Hash, Complex.Hash, and Bool.Hash methods all route through this single formula. The implementations are spread across objects/int.go, objects/float.go, objects/complex.go, and objects/bool.go, but the algorithm is shared.

String and bytes hashing

str.Hash first encodes the codepoints to bytes (UTF-8) and then calls Buffer. The bytes are not cached at the codepoint level; the encoded form is computed once per hash call.

bytes.Hash calls Buffer directly on the byte slice.

The hash is cached on the object after the first computation; subsequent hashes return the cached value without recomputing. The cache lives in a header field on Bytes and Str for exactly this reason.

Tuple and frozenset hashing

tuple.Hash combines the hashes of its elements with the same formula CPython uses (a multiplicative-additive mix with a specific seed). The formula is in Objects/tupleobject.c tuplehash.

// objects/tuple.go (*Tuple).Hash
func (t *Tuple) Hash() (int64, error)

frozenset.Hash is order-independent: the elements are hashed individually, then combined with XOR plus a length-dependent salt. The formula in Objects/setobject.c frozenset_hash is the one gopy implements.

set.__hash__ is None (mutable sets are unhashable).

Identity hashing

For objects with no custom __hash__, the default is the identity hash: the object's address, mixed with the hash secret to make it unpredictable across runs.

// objects/identity.go IdentityHash
func IdentityHash(o Object) int64

The identity is the Go pointer interpreted as a uintptr. The mix is one SipHash invocation over the pointer bytes; the result is the hash. Two objects with the same address (impossible while both are alive) would hash the same; this is the precise invariant the default __eq__ (identity-comparison) requires.

Unhashable

A type with __hash__ = None is unhashable. Attempting to use it as a dict key or set member raises TypeError. The check is in the dict and set implementations:

// objects/dict.go SetItem
func (d *Dict) SetItem(key, value Object) error {
h, err := Hash(key)
if err != nil { return err }
// ...
}

Hash(key) returns an error if the type's Hash slot is nil. The error is the standard TypeError("unhashable type: ...").

User code can also explicitly mark a type unhashable by setting __hash__ = None in its class body; the type constructor honours this and leaves the slot nil.

Hash and equality

The hash and equality invariant: if a == b then hash(a) == hash(b). Violating this leads to dicts that lose keys and sets that behave inconsistently.

The reverse is not required: hash(a) == hash(b) does not imply a == b. Collisions are expected and handled by the dict and set implementations.

Programs that define __eq__ should also define __hash__ (or explicitly mark the type unhashable). The type constructor enforces this by defaulting __hash__ to None when __eq__ is defined without __hash__.

Status

SipHash-1-3, FNV-1a, the secret, the numeric formula, the tuple/frozenset combiners, and the identity hash all work. The -1-to--2 remap is in place. PYTHONHASHSEED is honoured. sys.hash_info reports the right values.

Reference

  • Port source: hash/.
  • CPython source: Python/pyhash.c, Include/cpython/pyhash.h.
  • SipHash: a fast short-input PRF, Aumasson and Bernstein, 2012.
  • CVE-2011-4885 (the hash-flooding vulnerability that motivated hash randomisation in Python and other languages).