The Genome: Why Your Pipeline Needs a Bounded Mutation Surface

A self-improving pipeline doesn't start with an LLM. It starts with a file.

One file. One function. One surface the loop is allowed to mutate. The rest of the system stays fixed. The judge doesn't change. The contract doesn't shift. The only thing that moves is the genome.

Prerequisites

Read Lesson 01: The Mutation Engine — covers the bounded mutation contract and boundary-first design
Familiarity with data pipelines and ETL concepts
Basic understanding of LLMs and agent frameworks
Python experience (the examples use autogen-agentchat, pandas, and streamlit)

The Mutation Surface

The CleanLoop pipeline has one mutable file. Not the whole repo. One file keeps diffs reviewable, rollbacks clean, and blame traceable.

That file is clean_data.py. It wraps the runtime and delegates to the mutation playbook:


python
1 # clean_data.py — the mutable genome
2 from pathlib import Path
3 from . import clean_data_runtime as _runtime_clean
4 
5 def clean(input_dir: Path, output_path: Path) -> None:
6     _runtime_clean(input_dir, output_path)

Two lines. That's the genome. The loop proposes changes to this surface. The judge grades the result. If the score improves, the mutation survives. If it doesn't, the loop reverts.

Why One File Matters

When you allow the loop to mutate the whole codebase, you lose auditability. You can't blame a specific change. You can't roll back cleanly. You can't measure whether the loop is actually improving.

The one-file constraint forces discipline. The genome becomes a diff. The diff becomes a log. The log becomes evidence.

Constraint	Purpose
One mutable surface	Keeps diffs reviewable
One fixed judge	Enforces binary pass/fail
One artifact trail	Makes mutations inspectable

As covered in Lesson 01, the bounded mutation contract is the foundation. The genome is the surface that contract operates on.

The Runtime Split

The genome doesn't contain business logic. It delegates to the runtime. The runtime splits rows into two paths: deterministic first, mutation second.

The deterministic path handles known edge cases. Format normalization, type coercion, currency parsing. These rows succeed without the LLM.

The mutation playbook handles the rest. The LLM proposes a fix. The judge grades it. The loop remembers what worked.

The Starter Genome

Not every loop starts from scratch. The starter genome is a deterministic-only baseline. Same shape as the mutable genome, but it calls a different runtime path.

The starter genome lets you reset the loop. If the mutation drifts too far, or if you want to compare deterministic vs. stochastic recovery, you can fall back to the baseline.

The Mutation Playbook

The runtime doesn't call the LLM directly. It routes through the mutation playbook. The playbook normalizes tokens, resolves rules, and builds the candidate row.

The playbook keeps the LLM out of the runtime path. The runtime routes. The playbook resolves. The LLM proposes. The judge grades.

Why the Judge Stays Fixed

The moment the judge changes, the loop becomes untrustworthy. You can't measure improvement if the grading rubric shifts. You can't blame a mutation if the judge learns to pass everything.

The fixed judge enforces binary assertions — row count and column drift checks. No fuzzy scoring. No partial credit. The judge says pass or fail. The loop uses the delta to decide whether to commit or revert.

The Orchestration Seam

AutoGen sits at the orchestration seam, not inside the judge. The framework coordinates repair attempts and candidate generation. Deterministic code decides whether the mutation survives.


python
1 # loop.py — commit/revert gate
2 if new_score > score:
3     _git_commit(
4         f"loop: round {i} -- {new_score}/{new_total} "
5         f"(+{new_score - score}) {hypothesis[:40]}"
6     )
7     action = "commit"
8 else:
9     _git_revert()
10     _restore_genome_snapshot(GENOME_PATH, genome_before)
11     action = "revert"

The split keeps the loop inspectable. You can blame a specific mutation. You can roll back cleanly. You can measure whether the loop is actually improving over time.

The Evidence Trail

You can't trust a loop you can't inspect. The artifact trail logs every mutation decision:

Score deltas per round — was the loop improving or regressing?
Genome diffs — what exactly changed in clean_data.py?
Row-level decisions — which rows succeeded deterministically, which went through mutation, which stayed unresolved?

The Streamlit dashboard turns this evidence into operator-facing metrics. That's the difference between a system that learns and one that hallucinates.

Why Lesson 02 Locks the Genome

Lesson 01 proves the boundary. Lesson 02 locks the genome. You can't build a self-improving loop if you don't know what's allowed to change.

The genome becomes the first concrete artifact. From here, the course escalates to orchestration, observability, pressure testing, and safety.

Next Steps

This article covers the genome constraint behind Lesson 02 of the Self-Evolving Data Engineer course. The full lesson includes a live demo of the genome diff, a walkthrough of the mutation playbook, and a dashboard inspection of recovery rates.

Watch Lesson 02: [Link coming soon]

Previous in series: Lesson 01: The Mutation Engine

Next in series: Lesson 03: The Orchestrator moves to orchestration. The loop learns to coordinate repair attempts across multiple failure modes.

The full course and example code are open source on GitHub.

This article is part of the Self-Evolving Data Engineer series. The loop only works when the boundary stays narrow and the enforcement stays fixed.

1	# clean_data.py — the mutable genome
2	from pathlib import Path
3	from . import clean_data_runtime as _runtime_clean
4
5	def clean(input_dir: Path, output_path: Path) -> None:
6	_runtime_clean(input_dir, output_path)

1	# loop.py — commit/revert gate
2	if new_score > score:
3	_git_commit(
4	f"loop: round {i} -- {new_score}/{new_total} "
5	f"(+{new_score - score}) {hypothesis[:40]}"
6	)
7	action = "commit"
8	else:
9	_git_revert()
10	_restore_genome_snapshot(GENOME_PATH, genome_before)
11	action = "revert"