An OCR Pipeline for Machine Learning: the Good, the Bad, and the Ugly

Introduction

  • Legal documents.
  • Financial statements.
  • Patient records.
  • Even forms filled out by customers.
Figure 1: Example of a page from a scanned document.

What’s OCR? And what’s a pipeline?

Figure 2: Raw text extracted from the scanned page in Figure 1.
  • Confidence scores for the results.
  • The coordinates of each detected word on the scanned document.
  • Automating/streamlining administrative processes.
  • Index and search through the content of documents efficiently.
  • Integrating scanned documents with other existing datasets.
  • Classifying scanned pages/documents based on the text they contain.
  • Applying Natural Language Processing (NLP) techniques to analyze the textual data for key features/patterns.
  • Automatically summarizing documents.
  • And more!

A Project

Figure 3: OCR pipeline architecture diagram.
  1. Scrape for publicly available case files in the client’s domain. These are typically scanned PDF documents published online.
  2. Process these files using an OCR system, which outputs the document’s text and structure.
  3. Post-process OCR output to correct common errors and generate normalized documents.
  4. Extract important legal information (features) from the document text and structure, storing results in the case database.

Prototyping

  • Google Cloud Vision OCR.
  • Tesseract OCR.
  • Microsoft Computer Vision API.
  • Amazon Textract.
Figure 4: Low-quality scan of a table. AWS Textract misclassified commas as periods in values such as “1,371”.
  • Is the number inside a financial statement? How will our Feature Extractor program know if it’s looking at a financial statement?
  • Is the value a subtotal? Maybe should we look at the sum of the numbers directly above it?
  • Is the value abbreviated? Financial statement values are sometimes written in thousands or millions of dollars, and this fact may not even be present on the page.
  • What part of the world was the document written in? There are differing conventions used for commas vs. periods in numbers.
  • How many digits are after the punctuation? If this is a dollar value, there are conventionally two digits after the decimal point to represent cents.
Figure 5: Examples of varying formats for the same type of table — a cash flow statement.

Moving to Production

File Storage

Web Scrapers

Textract Considerations

Keeping it robust

Looking Ahead

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hypotenuse Labs

Hypotenuse Labs

Building incredible web, AI, and blockchain solutions since 2018.