OCR-GPT pipeline

The OCR-GPT pipeline was developed to improve text data preparation for the IUROPA project by combining Adobe’s OCR for better paragraph recognition and GPT-4 for correcting character recognition errors in older EU court judgments.

Image may contain: Font, Material property, Parallel, Rectangle, Screenshot.

Abstract

The OCR-GPT pipeline was developed to assist researchers in the IUROPA project at the University of Gothenburg and at the ARENA Centre for European Studies, at the Faculty of Social Sciences in the text data preparation process. In the context of the IUROPA project we were tasked with cleaning a large corpus of text data consisting of older judgments from the Court of Justice of the EU, which were only available in scanned PDF format. The original OCR (Optical Character Recognition) processing solution using Google Vision was unreliable in parsing paragraph order, especially, in the case of unstructured or multi-column text. The researchers initially wanted to investigate the application of a Large Language Model (LLM) for fixing this problem. But after several rounds of tests, we realized that LLM, despite being powerful in fixing character and word recognition issues, failed to correct the paragraph parsing issue caused by the OCR process. After investigating many OCR tools, we discovered that Adobe OCR engine performs better with layout and paragraph recognition and solves that problem at the expense of character recognition accuracy. However, later we could use a GPT model to correct those character recognition errors efficiently. Hence, the final solution for this project was a pipeline that combined the OCR process using Adobe engine and an LLM text correction process using GPT-4.


Background

For more background information on this project visit this website. 


Methodology

One of the initial solutions at IUROPA to address the paragraph layout recognition problem was to manually clean the data, using a web interface. This produced a considerable amount of corrected text; however, it proved too slow and costly to be applied to the entire dataset. As an alternative solution, the researchers in IUROPA wanted to investigate using a GPT model to fix the paragraph layout recognition problem. To test this idea, we ran 50 cases of the Google Vision OCR output through the GPT-4 model, and compared both outputs with the text that was manually corrected (ground truth). As shown in figure 1, although the GPT-4 process reduced the edit

Image may contain: Rectangle, Font, Line, Parallel, Circle.
Figure 1. Normalized edit distance of the Google Vision output before (red) and after GPT4 treatment (blue) compared with the manually corrected text (ground truth).

istance in many cases, for a considerable number of cases the edit distance remained significantly high.

 

After a deeper inquiry into the result of the first test, we realized that applying the GPT-4 model works well for correcting character recognition issues, but fails to solve the primary problem, i.e. the paragraph ordering and layout in the cases of unstructured or multi-column text. Upon studying several alternatives, we found that Adobe OCR engine is superior when it comes to detecting paragraph order and layout, while Google Vision’s  character recognition is more accurate. This discovery led us to the final solution, where we replaced Google Vision with Adobe OCR. The task of recovering the correct spelling from the distorted text (caused by Adobe OCR) was then given to a GPT-4 model. As shown in figure 2, the edit distance for the Adobe OCR – GPT pipeline output with the ground truth is significantly better than for the Google Vision output.

Image may contain: Rectangle, Slope, Plot, Font, Line.
Figure 2. The graph shows a comparison of editing distance to the ground truth between using Google Vision pipeline (blue) and using Adobe OCR engine + GPT4 pipeline (green).

The OCR-GPT pipeline consists of an automatic OCR process that is done using Adobe’s “Actions” followed by a GPT-4 processing of the text for correcting character and word recognition errors. The GPT-4 processing has taken place using OpenAI APIs that were set up in collaboration with IT Department at the University of Oslo. We fully utilized the API’s capabilities for running parallel calls and managed to process 50,000 pages of documents within 12 days (5 days for OCR and 7 days for GPT-4 processing). The abovementioned pipeline yielded data of significantly higher quality for further analysis by researchers in the IUROPA project.

Published Aug. 29, 2024 2:54 PM - Last modified Aug. 29, 2024 3:05 PM