Skip to content

ADR 0004 — Database collation

  • Status: Accepted
  • Date: 2026-04-27
  • Decision drivers: index stability across host upgrades, predictable performance, explicit (not implicit) locale-aware ordering

Context

PostgreSQL fixes the database-wide collation at initdb time. The choice affects how indexes order strings, how case-insensitive comparisons behave (ILIKE, lower(), citext), and — crucially — whether indexes survive host OS upgrades.

The three common defaults:

  • en_US.UTF-8 (or any locale-aware libc collation). Mac/Linux distro default. Glibc-version dependent: when the host OS upgrades glibc, sort order can change silently and previously-built B-tree indexes become subtly corrupt. This is a well-documented production hazard for any non-trivial Postgres deployment.
  • C.UTF-8 (libc, byte-order Unicode). Stable across host versions. Fast. Sort order is byte order — not "natural" Portuguese, not locale-aware.
  • und-x-icu / en-US-x-icu / pt-BR-x-icu (ICU). Version-stable across hosts (ICU ships with the Postgres image). Locale-aware. Slightly higher overhead than C.UTF-8.

Decision

The Molib database is initialized with LC_COLLATE=C.UTF-8, LC_CTYPE=C.UTF-8, encoding UTF-8.

Concretely, the db service in docker-compose.yml sets:

environment:
  POSTGRES_INITDB_ARGS: "--encoding=UTF8 --locale=C.UTF-8"

For columns that genuinely need natural-language ordering (e.g. sorting user-visible names in Brazilian Portuguese for a UI list), apply an ICU collation per column or per query — for example name text COLLATE "pt-BR-x-icu", or ORDER BY name COLLATE "pt-BR-x-icu". Locale-aware sort is opt-in and explicit; the default is fast, byte-stable, and survives glibc upgrades.

Application-level normalization (lowercase, trim, accent-fold) remains the right answer for case-insensitive uniqueness. With C.UTF-8, a unique constraint on a pre-normalized lowercase column behaves as expected without needing citext or LOWER() functional indexes.

Consequences

Positive:

  • No glibc-version corruption risk. Container base-image upgrades, host migrations, and replica setups are all safe.
  • Predictable, high-performance default for indexes.
  • Application boundaries do the normalization work — no implicit "the database happens to be case-insensitive" surprises.
  • Locale-aware sort is explicit, easy to audit, and scoped to where it actually matters.

Negative:

  • Default ORDER BY some_text is byte order, not natural Portuguese order. UI lists that present user-visible strings sorted naturally must specify COLLATE "pt-BR-x-icu" (or similar) explicitly. Document this on relevant columns in wiki/models/.
  • Pattern matching against accented characters is byte-level. When the project needs accent-insensitive search, install the unaccent extension and pre-normalize at the application layer. Out of scope for now.

Operational notes

  • This ADR's locale takes effect at initdb only. Existing pgdata volumes are not retroactively re-collated. When this ADR is first applied to a project that already has a pgdata volume, the volume must be wiped (docker compose down -v) and re-initialized. Acceptable now (no real data); after data exists, switching collation is a pg_dump / pg_restore migration.
  • The dev db service should pin its image (already done: postgres:18.3) so initdb behavior is reproducible.

Triggers for revisiting

  • Significant UX over sorted Portuguese (or other locale-natural) strings — may justify per-column ICU at scale, or even a re-init with pt-BR-x-icu as the default if it dominates.
  • Multilingual content with heavy collation needs — same trigger.
  • Accent-insensitive search becomes a feature → install unaccent; still inside this ADR's framework.