Fitness Functions: Unit Tests for Your Architecture

Most teams have a testing strategy. Unit tests for behavior. Integration tests for contracts. E2E tests for user flows. Linters for code style.

But ask: what tests your architecture?

When a developer imports from another module’s internals — what catches it? When a dependency flows in the wrong direction — what fails? When a new service bypasses the repository layer and queries the database directly — what blocks the PR?

Usually, nothing. The architecture is the one thing everyone agrees matters and nobody automates.

The Missing Layer

The testing pyramid has a gap. Linters catch style. Unit tests catch logic. Integration tests catch contracts. Nothing catches structural decay — the slow erosion of module boundaries, layer violations, coupling that grows session by session.

Fitness functions fill that gap. A fitness function is any automated check that protects an architectural property. Not what the code does — how the code is organized.

The boundary between a unit test and a fitness function is the question they answer:

Unit test: “Does this function return the correct value?”
Fitness function: “Does this module still respect its boundaries?”

Both are automated. Both fail when things break. Both prevent regression. But they protect different things — behavior vs. structure.

Why This Matters in the AI Era

Before AI, structural degradation was slow. One shortcut per sprint. One boundary violation per feature. Teams noticed it in quarterly reviews, if they noticed at all.

AI changed the speed. An agent can generate 20 files in a session. If those files follow the wrong pattern or cross the wrong boundary, the degradation that used to take months happens in an afternoon.

Faster code generation without structural enforcement is faster decay.

This is not the AI’s fault. The agent follows whatever patterns exist in the codebase. A well-structured codebase with clear boundaries produces reliable output. A messy codebase with implicit rules produces inconsistent output regardless of the model.

But the relationship goes deeper than speed. AI agents need structure to work well — and fitness functions are what keep that structure intact.

AI amplifies whatever patterns exist. When boundaries are clean, the agent respects them. When boundaries have exceptions, the agent copies the exceptions. One violation in a module becomes the pattern the agent follows for every new file in that module. Without fitness functions, a single shortcut propagates at machine speed.

AI can’t review its own architecture. An agent generates code that passes linters and tests. It looks correct. But whether the code respects module boundaries, whether it maintains layer direction, whether it keeps contracts consistent — these are structural properties the agent doesn’t check unless you give it a fitness function that checks.

AI makes the cost of not enforcing structure exponential. With one engineer writing code manually, an unenforced rule produces maybe one violation per week. With multiple AI agents generating code in parallel, the same unenforced rule produces violations in every session. The longer you wait to add the fitness function, the more cleanup you need.

Structure was always important. AI made it critical infrastructure.

Architecture Is Not One Thing

Architecture spans multiple dimensions. Each dimension can degrade independently. Each needs its own fitness functions.

Dimension	What degrades without it	What the fitness function checks
Structure	Module boundaries blur, coupling grows silently	Import direction enforcement, no circular dependencies, coupling thresholds
Contracts	API changes break clients, schema drifts from reality	Breaking change detection on PR, schema-to-code drift alerts
Data	Migrations destroy data, schemas diverge from entities	Destructive operation blocking, entity-schema consistency validation
Maintainability	Files grow, complexity hides, dead code accumulates	Cyclomatic complexity limits, file size caps, dead export detection
Performance	Bundles bloat, queries slow, response times creep	Bundle size budgets per route, query cost limits, response time thresholds
Security	Secrets leak, dependencies rot, endpoints go unprotected	Secret scanning in commits, CVE detection, auth enforcement on all endpoints
Reliability	Errors cascade, timeouts are missing, retries are absent	Error boundary requirements, timeout enforcement, retry policy validation
Observability	Incidents take hours to diagnose, logs are unstructured	Structured log format enforcement, required trace spans on critical paths

Not every project needs every dimension. A payment system needs strong security, data, and reliability. A content platform may prioritize performance and contracts. A monorepo with many teams needs structure and contracts above all.

The question to ask for each dimension: “If this degrades silently, does the system fail in a way that matters?” If the answer is yes, that dimension needs a fitness function. Most codebases only have automated checks for maintainability (linters) and maybe contracts (type checking). Everything else is trust and manual review — which works until it doesn’t.

What This Looks Like in Practice

The dimensions are abstract until you see them applied to a real system. Here’s what fitness functions look like for different project types:

API:

Dimension	Fitness function
Contracts	Schema breaking change detection on every PR
Structure	Resolver → Service → Repository — no skipping layers
Data	Migration naming conventions, entity-schema drift detection
Security	Auth decorator required on all resolvers
Maintainability	Resolver complexity limits

Frontend: React, Next.js

Dimension	Fitness function
Performance	Bundle size budget per route
Structure	Component → Hook → Service — no direct API calls from components
Contracts	API types generated from schema, no manual type definitions
Maintainability	Component file size limits

Monorepo with multiple teams:

Dimension	Fitness function
Structure	Package boundary enforcement — import only from public API
Structure	No circular dependencies between packages
Contracts	Shared type packages as single source of truth
Maintainability	Shared lint and TypeScript config enforcement

The pattern is the same across project types. Identify the critical dimensions. Write checks that protect them. The specific checks change, but the approach stays consistent.

The Ratchet Pattern

Architecture quality should only go in one direction: up.

The pattern is straightforward:

Clean up a module — fix boundary violations, standardize patterns
Add the module to an enforcement list
Violations in that module become errors, not warnings
New modules are enforced from creation

The enforcement list grows over time. Quality ratchets up. It never goes down.

A common mistake is writing fitness functions for the codebase you want, not the one you have. That breaks everything on day one. The right order: clean up a module first, then add the fitness function to prevent regression. The function protects the work you already did — it’s a lock, not a goal.

In practice, this works well with architecture migrations. A team migrating a legacy codebase doesn’t need to fix everything before enforcing anything. Migrated modules go on an allowlist — violations are errors. Unmigrated modules get warnings. As the migration progresses, the allowlist grows. The architecture improves incrementally, and every step is protected.

This avoids two failure modes. Enforcing everything at once breaks the existing codebase. Enforcing nothing lets migration work regress immediately. The ratchet is the middle path — protect what is already clean, improve the rest gradually.

Scope and Timing

Fitness functions also vary by what they check and when they check it.

Scope matters because some problems are local and some are systemic:

Atomic — a single check on a single property. “This file has a lint error.” “This function exceeds complexity limits.” Fast, focused, catches violations as they happen.
Holistic — combines multiple signals across the system. “All module boundaries respected AND tests pass AND contracts valid.” Catches problems that only emerge when components interact.

Timing matters because different problems appear at different speeds:

Timing	When it runs	What it catches	Example
Triggered	On file edit, commit, or PR	Individual violations as they happen	Boundary check after every edit. Schema breaking change detector on PR
Continuous	Daily or weekly schedule	Gradual decay nobody notices	Dead export scan. Dependency freshness check. Convention drift report
Temporal	Pre-release or quarterly	Accumulated risk at scale	Security audit before launch. Architecture review when the team doubles

A system needs all three. Triggered checks catch mistakes as they happen. Continuous checks catch the slow drift between mistakes. Temporal checks catch the risk that accumulates across both.

Evolve With Your Constraints

The right fitness functions depend on the project phase.

Early stage — few rules, flexible patterns. The constraint is speed. Fitness functions would slow down exploration for problems you don’t have yet.

Growth stage — more rules, stricter boundaries. The constraint is consistency. Multiple engineers, multiple AI agents, all generating code. Without enforcement, patterns diverge.

AI-driven stage — fitness functions on every critical dimension. The constraint is trust. The volume of generated code is too high for manual review to catch structural problems.

The temptation is to build fitness functions for problems you don’t have yet. That’s premature enforcement — the architectural equivalent of premature optimization. The better approach is to identify the current constraint, build the checks that protect against it, and evolve when the constraint changes.

Where to Start

Most codebases already have the raw material. Linters check style. Type checkers catch contract violations. Test suites verify behavior. What’s usually missing is the structural layer — the checks that protect architectural decisions.

A practical starting point:

Pick your critical dimensions. Look at the table above and ask: “If this dimension degrades, does the system fail in a way that matters?” For most teams, structure and contracts come first.
Check what’s enforced vs. what’s documented. If a rule exists only in documentation or team knowledge, it’s a candidate for a fitness function. The gap between what’s written and what’s enforced is where decay happens.
Start with one module. Pick the cleanest module in the codebase. Add boundary enforcement. Make violations errors. That module is now protected. One module is better than zero.
Ratchet from there. As more modules improve, add them to enforcement. New modules get enforced from creation. The allowlist grows. Quality only goes up.
Add timing layers. Triggered checks for fast feedback during development. A weekly continuous check for drift (dead code, dependency freshness). A quarterly review for accumulated risk.

The architecture that matters is not the one that never changes. It is the one that changes at the right time, for the right reason — and has fitness functions to make sure it doesn’t change back.