The Machine That Reads Badly

A two-part machine learning debugging adventure: first train a model that doesn't work, then fix what the machine broke.

The Machine That Reads Badly

What happens when the machine reads badly? Finally, an honest answer.

Tutorials about machines that read love the happy path. Clean image in, perfect text out, and somewhere in the middle a pipeline behaves like a well-trained employee. Reality is different. In reality the passport is tilted, the hologram is doing exactly its job, the font was designed in 1968, and the OCR engine confidently reports that your surname contains a 5.

This two-part series builds a real, open-source ID-document scanner — and then builds the second system every real scanner needs: the one that fixes the first one's predictable mistakes. Fair warning for the machine-learning crowd: there is no machine learning in it. A lookup table and a specification document outperform the hype, and that is rather the point.

Status: Launches 2 September 2026. Both parts publish weekly.


The episodes

  1. 🤖 Part I: Teaching Silicon to See — building the scanner: image preprocessing four ways, Tesseract diplomacy, MRZ formats, and why a pipeline beats a prayer
  2. 🔧 Part II: Fixing What the Machine Broke — position-aware error correction, the great filler-character conspiracy, the left-shift problem, and a confidence system that knows what it doesn't know

Published parts appear with their links in the list below, automatically.


What to expect

Two parts, one real codebase (it's on GitHub — clone it and break it). Honest engineering about unreliable input: if your data has structure, your errors can be fixed; if your errors can be characterised, your corrections can be systematic. Funny where it can be, precise where it must be.

← Back to all series

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Codyssey.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.