ADR 0004 — Database collation¶
- Status: Accepted
- Date: 2026-04-27
- Decision drivers: index stability across host upgrades, predictable performance, explicit (not implicit) locale-aware ordering
Context¶
PostgreSQL fixes the database-wide collation at initdb time. The choice affects how indexes order strings, how case-insensitive comparisons behave (ILIKE, lower(), citext), and — crucially — whether indexes survive host OS upgrades.
The three common defaults:
en_US.UTF-8(or any locale-aware libc collation). Mac/Linux distro default. Glibc-version dependent: when the host OS upgrades glibc, sort order can change silently and previously-built B-tree indexes become subtly corrupt. This is a well-documented production hazard for any non-trivial Postgres deployment.C.UTF-8(libc, byte-order Unicode). Stable across host versions. Fast. Sort order is byte order — not "natural" Portuguese, not locale-aware.und-x-icu/en-US-x-icu/pt-BR-x-icu(ICU). Version-stable across hosts (ICU ships with the Postgres image). Locale-aware. Slightly higher overhead thanC.UTF-8.
Decision¶
The Molib database is initialized with LC_COLLATE=C.UTF-8, LC_CTYPE=C.UTF-8, encoding UTF-8.
Concretely, the db service in docker-compose.yml sets:
For columns that genuinely need natural-language ordering (e.g. sorting user-visible names in Brazilian Portuguese for a UI list), apply an ICU collation per column or per query — for example name text COLLATE "pt-BR-x-icu", or ORDER BY name COLLATE "pt-BR-x-icu". Locale-aware sort is opt-in and explicit; the default is fast, byte-stable, and survives glibc upgrades.
Application-level normalization (lowercase, trim, accent-fold) remains the right answer for case-insensitive uniqueness. With C.UTF-8, a unique constraint on a pre-normalized lowercase column behaves as expected without needing citext or LOWER() functional indexes.
Consequences¶
Positive:
- No glibc-version corruption risk. Container base-image upgrades, host migrations, and replica setups are all safe.
- Predictable, high-performance default for indexes.
- Application boundaries do the normalization work — no implicit "the database happens to be case-insensitive" surprises.
- Locale-aware sort is explicit, easy to audit, and scoped to where it actually matters.
Negative:
- Default
ORDER BY some_textis byte order, not natural Portuguese order. UI lists that present user-visible strings sorted naturally must specifyCOLLATE "pt-BR-x-icu"(or similar) explicitly. Document this on relevant columns inwiki/models/. - Pattern matching against accented characters is byte-level. When the project needs accent-insensitive search, install the
unaccentextension and pre-normalize at the application layer. Out of scope for now.
Operational notes¶
- This ADR's locale takes effect at
initdbonly. Existingpgdatavolumes are not retroactively re-collated. When this ADR is first applied to a project that already has apgdatavolume, the volume must be wiped (docker compose down -v) and re-initialized. Acceptable now (no real data); after data exists, switching collation is apg_dump/pg_restoremigration. - The dev
dbservice should pin its image (already done:postgres:18.3) so initdb behavior is reproducible.
Triggers for revisiting¶
- Significant UX over sorted Portuguese (or other locale-natural) strings — may justify per-column ICU at scale, or even a re-init with
pt-BR-x-icuas the default if it dominates. - Multilingual content with heavy collation needs — same trigger.
- Accent-insensitive search becomes a feature → install
unaccent; still inside this ADR's framework.