An OCR Pipeline for Machine Learning: the Good, the Bad, and the Ugly

9 min readFeb 20, 2020

Disclaimer: All material in this post has been used with permission. Certain details omitted for client confidentiality.

tl;dr Optical Character Recognition (OCR) can help turn your scanned documents into useful data, automate your manual processes, and save massive amounts of time and effort.

Sounds straightforward — but what actually goes into it? We illustrate with a case study: an OCR data pipeline for a legaltech company with Python, AWS Textract, and Puppeteer.

Introduction

We’ve met many companies struggling to manage huge libraries of paper documents. They’re a treasure trove of messy, but important information for their business.

Some examples include:

Legal documents.
Financial statements.
Patient records.
Even forms filled out by customers.

The first step to managing these is digitization. Let’s scan them to a PDF file:

*Figure 1: Example of a page from a scanned document.*

Suppose you’re a company that needs to search through a collection of scanned loan applications. Applications could contain tables, free-form responses, and handwritten answers.

Search queries can be fairly complex, such as “find applications where the applicant has an income less than $40,000 and didn’t fully answer the fourth question.”

Typically, a team of employees do this by reading through scanned files and documents, filtering them based on the search query. They might also manually input document data into a spreadsheet or database.

However, this can be hugely time-consuming. With a large library of files, it’s just not feasible.

What’s OCR? And what’s a pipeline?

Optical Character Recognition (OCR) is used to process images or scanned documents to produce raw text or other structured output.

Using OCR software, a company can process all of their scanned loan applications. They can then collate the resulting data in a database for easy querying.

This takes a fraction of the time it would take to perform manually.

To do this at scale, we need an automatic sequence of steps — a data pipeline. This pipeline transforms scanned documents into raw text data with OCR.

Businesses with a recurring need for searching and analyzing information from physical documents would benefit the most from one.

Here’s an example of output from running Figure 1 through a popular OCR package:

*Figure 2: Raw text extracted from the scanned page in Figure 1.*

The software outputs text detected inside the document. This text can be stored into a database for search and analysis later on.

Some OCR programs also output additional information, including:

Confidence scores for the results.
The coordinates of each detected word on the scanned document.

Depending on your use case, this output data may require some pre-processing prior to storing results. Typically, results are converted into a normalized format, allowing multiple use cases with the same common data structures.

With this raw data, we can do a lot of interesting things! Some ideas:

Automating/streamlining administrative processes.
Index and search through the content of documents efficiently.
Integrating scanned documents with other existing datasets.
Classifying scanned pages/documents based on the text they contain.
Applying Natural Language Processing (NLP) techniques to analyze the textual data for key features/patterns.
Automatically summarizing documents.
And more!

Side note: Some PDF readers have built-in OCR to help when searching a scanned document. However, they usually aren’t capable of large-scale text extraction across many documents. Or interpreting structured data such as tables and forms. One exception is ABBYY, which offers the technology underlying ABBYY FineReader as a web API.

A Project

Our client approached us to build an intelligent search and recommendation engine for a specific legal domain.

A foundational component of this would be an OCR pipeline. This is responsible for automatically ingesting new legal cases on a recurring basis.

*Figure 3: OCR pipeline architecture diagram.*

Scrape for publicly available case files in the client’s domain. These are typically scanned PDF documents published online.
Process these files using an OCR system, which outputs the document’s text and structure.
Post-process OCR output to correct common errors and generate normalized documents.
Extract important legal information (features) from the document text and structure, storing results in the case database.

At the end of the pipeline, features are stored in a database and spatially indexed for efficient clustering.

Prototyping

Before building the full pipeline, the first task was to test the feasibility of the OCR System and Feature Extractor components.

A company might be able to develop an OCR solution in-house, which may work well for a specific domain. However, developing one from scratch is a significant undertaking.

Based on the project’s budget, requirements, and timeline, it made more sense to go with an off-the-shelf OCR solution, and customize it to fit our needs.

We tested various cloud OCR solutions with our client’s documents, such as:

Google Cloud Vision OCR.
Tesseract OCR.
Microsoft Computer Vision API.
Amazon Textract.

Many of these are based on machine learning models trained on millions of documents.

In the end, we found the best option to be Amazon Textract, for its accuracy and unique ability to read table contents in a structured manner. It is also relatively well-supported for production-sized applications, and is well integrated with the AWS ecosystem.

However, no OCR program is perfect. It’s important to audit the results for common mistakes that could harm your product’s results.

Challenge: When comparing the scanned legal documents against Textract’s OCR output, the OCR output would contain spelling mistakes, typos, and misinterpretation of certain characters.

Consider the simple task of reading a number. When reading numbers containing commas, Textract may instead label them as periods.

*Figure 4: Low-quality scan of a table. AWS Textract misclassified commas as periods in values such as “1,371”.*

When reading crucial numbers from financial statements, this could cause results to be incorrect by orders of magnitude!

Mistaking a $1,065 for $1.065 would seriously damage our recommendation engine’s performance — not acceptable.

In modern ML applications, obtaining high quality and clean training data is often the biggest challenge.

We noticed errors were more frequent in lower DPI scans. Especially when the scanner had applied a threshold filter.

Solution: Errors such as the ambiguous commas and decimals in Figure 4 are corrected via a dedicated post-processing component in our pipeline.

As a human, determining whether that black dot is a period or a comma is easy.

But this human reasoning is made up of many considerations:

Is the number inside a financial statement? How will our Feature Extractor program know if it’s looking at a financial statement?
Is the value a subtotal? Maybe should we look at the sum of the numbers directly above it?
Is the value abbreviated? Financial statement values are sometimes written in thousands or millions of dollars, and this fact may not even be present on the page.
What part of the world was the document written in? There are differing conventions used for commas vs. periods in numbers.
How many digits are after the punctuation? If this is a dollar value, there are conventionally two digits after the decimal point to represent cents.

When writing this software, all of the above reasoning would need to take place. We do this by including contextual features to our model, such as surrounding text, document headings, and the structure of the table.

Models in this domain are typically either rule matchers, or statistical models, when there is enough of high-quality training data. Rule matchers would explicitly encode the logic above, while statistical models would learn it by example.

In our client’s domain, financial figures almost never contained decimals. The legal cases were also all from the same country. This made number correction a trivial pattern match.

For this and other common OCR errors, we were able to consider context and domain knowledge to automatically correct them.

Side note: We previously noted that tilted pages, words split across lines (via hyphens), and stamps/handwriting/cosmetic damage to the physical document caused OCR errors. However, these errors did not have any material impact in our particular dataset.

Challenge: Textract was excellent at identifying and reading tables! But even seemingly-standardized items (like financial statements) come in various shapes and forms with differing terminology.

*Figure 5: Examples of varying formats for the same type of table — a cash flow statement.*

Tables contain a lot of implied structural content, often requiring advanced reasoning to understand.

It’s a much more difficult task than differentiating commas vs. decimals, and we also need to replicate this reasoning programmatically.

Solution: Since our main goal was to automatically find specific features, we needed to review examples of these features being extracted by a legal expert.

Our client annotated a small set of legal cases to use as a reference/training dataset. We generalized these examples into heuristics, collected more annotated data, and improved them over time.

In general, successful OCR solutions are tailored to your feature extraction use cases.

Just as we would not expect an untrained layperson to read a medical chart, we would not expect our programs to read a financial statement without an impractical amount of training.

Moving to Production

Productionization generally entails a lot of work in robustness, performance, and polish. From the very first prototype, we gradually fortified our code until the final prototype was already nearly there.

File Storage

We had a lot of files to contend with: downloaded legal case files, highly granular OCR outputs, and documents in normalized form. This justified dedicated blob storage, for which we chose S3.

Web Scrapers

We wrote web scrapers for each legal firm website using Puppeteer, a powerful headless browser automation library.

Although it is far more resource-intensive than most scraping frameworks, Puppeteer also most faithfully reproduces the behaviour of a real user.

The scrapers are run weekly via cron. This was more than sufficient for our use case based on the expected frequency of data being posted and updated.

The scraped results were stored in S3 and organized in a database.

Textract Considerations

AWS Textract is useful but expensive, especially if your documents contain forms and tables. If we were to reprocess documents every time we scraped them, costs would explode.

To prevent this, we use content-addressing all the way through the pipeline. Changed documents are automatically reprocessed, and unchanged documents are left as-is.

Keeping it robust

Web scrapers can be brittle. During the prototyping phase, frequent changes to the legal firm websites would affect the entire extraction process. This is unacceptable in a production-grade application.

We first separated each component of the pipeline to be able to run them in isolation. Each component was then written to be a fully re-entrant, pure function of the previous component’s output.

With this in place, we can safely interrupt and restart any pipeline stage, and content-addressable caching ensures we only recompute what’s absolutely necessary.

Looking Ahead

Building a machine learning (ML) model sounds glamorous. In practice, there’s a lot of unsexy work that underlies it. We haven’t even gotten to the recommendation engine yet!

Our toughest task in this project was ensuring that the data extracted by our pipeline was accurate. It wasn’t glamorous or pretty.

Much of it included manually reviewing OCR outputs, building out different feature-finding heuristics, verifying the correctness of our results, and making incremental improvements.

Fortunately, this work pays off. High quality data is a prerequisite to a properly trained ML model. If you base your model on low-quality input data, its outputs will be useless, and in some cases, disastrous.

Or as they say, “garbage in, garbage out.”

Hypotenuse Labs is an elite team of software consultants. Hailing from Facebook, Amazon, Uber, and Snap, we specialize in delivering web and AI software products for startups and SMBs.

If you’re scared of dirty data doing you dirty, contact us at hello@hypotenuselabs.com.