An OCR Pipeline for Machine Learning: the Good, the Bad, and the Ugly

Introduction

We’ve met many companies struggling to manage huge libraries of paper documents. They’re a treasure trove of messy, but important information for their business.

  • Financial statements.
  • Patient records.
  • Even forms filled out by customers.
Figure 1: Example of a page from a scanned document.

What’s OCR? And what’s a pipeline?

Optical Character Recognition (OCR) is used to process images or scanned documents to produce raw text or other structured output.

Figure 2: Raw text extracted from the scanned page in Figure 1.
  • The coordinates of each detected word on the scanned document.
  • Index and search through the content of documents efficiently.
  • Integrating scanned documents with other existing datasets.
  • Classifying scanned pages/documents based on the text they contain.
  • Applying Natural Language Processing (NLP) techniques to analyze the textual data for key features/patterns.
  • Automatically summarizing documents.
  • And more!

A Project

Our client approached us to build an intelligent search and recommendation engine for a specific legal domain.

Figure 3: OCR pipeline architecture diagram.
  1. Process these files using an OCR system, which outputs the document’s text and structure.
  2. Post-process OCR output to correct common errors and generate normalized documents.
  3. Extract important legal information (features) from the document text and structure, storing results in the case database.

Prototyping

Before building the full pipeline, the first task was to test the feasibility of the OCR System and Feature Extractor components.

  • Tesseract OCR.
  • Microsoft Computer Vision API.
  • Amazon Textract.
Figure 4: Low-quality scan of a table. AWS Textract misclassified commas as periods in values such as “1,371”.
  • Is the value a subtotal? Maybe should we look at the sum of the numbers directly above it?
  • Is the value abbreviated? Financial statement values are sometimes written in thousands or millions of dollars, and this fact may not even be present on the page.
  • What part of the world was the document written in? There are differing conventions used for commas vs. periods in numbers.
  • How many digits are after the punctuation? If this is a dollar value, there are conventionally two digits after the decimal point to represent cents.
Figure 5: Examples of varying formats for the same type of table — a cash flow statement.

Moving to Production

Productionization generally entails a lot of work in robustness, performance, and polish. From the very first prototype, we gradually fortified our code until the final prototype was already nearly there.

File Storage

We had a lot of files to contend with: downloaded legal case files, highly granular OCR outputs, and documents in normalized form. This justified dedicated blob storage, for which we chose S3.

Web Scrapers

We wrote web scrapers for each legal firm website using Puppeteer, a powerful headless browser automation library.

Textract Considerations

AWS Textract is useful but expensive, especially if your documents contain forms and tables. If we were to reprocess documents every time we scraped them, costs would explode.

Keeping it robust

Web scrapers can be brittle. During the prototyping phase, frequent changes to the legal firm websites would affect the entire extraction process. This is unacceptable in a production-grade application.

Looking Ahead

Building a machine learning (ML) model sounds glamorous. In practice, there’s a lot of unsexy work that underlies it. We haven’t even gotten to the recommendation engine yet!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store