Structured OCR for newspapers: Use of Yolox and Vision LLMS

1. Introduction

Build a Structured OCR for newspapers is not a simple task. Unlike books or documents, newspapers are disorderly – often noisy, biased and low resolution.

Traditional OCR tools are fighting with such complex arrangements.

The newspapers also do not follow a standard provision. They use several columns, legends, mixed fonts and articles that can jump on the pages.

For this reason, tools like Tesseract often refer mixed and unstructured text. These tools read the line by line, without understanding the context.

But what happens if you need structured data such as titles, authors, dates or page numbers? The raw text is simply not enough.

To resolve this, we will combine Yolox To detect layout blocks with LLM vision For intelligent text extraction.

This modern OCR pipeline transforms the digitized pages into clean and structured JSON – each block labeled and ordered correctly.

This blog guides you on how to build a Structured OCR for newspapers Use of modern AI tools.

Let’s dive.

2. Presentation of the project: OCR Structured for newspapers

This project helps to extract Structured content Digitized newspaper pages. The system detects layout blocks, such as titles, legends and bodies of the article – then reads the text using AI.

Here’s how it works:

A user downloads a newspaper image.
The system detects blocks like titles, subtitles, textAnd legends Using YOLOX.
Each block is sent to an OCR engine:
- Easyocr For simpler contents
- LLM vision For dense or complex regions
The extracted text is grouped and labeled.
Clean, structured Json The file is returned.

This JSON can be used for research, digital archiving or consultable databases. It is both readable by machine and easy to understand.

Key components

Yolox – for the detection of objects and the analysis of the arrangement
Easyocr / Vision LLM – for flexible text extraction
Python 3.10 – With .env for the management of API keys

This system can be executed locally or on a small server. A GPU helps, but it is not strictly required for tests.

3. Form Yolox for structured OCR in newspapers

Before running the pipeline, you will have to form a personalized YOLOX model which can detect the types of newspaper blocks.

3.1 Create a virtual environment

Use Python 3.10.13:

python3.10 -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate    # Windows

3.2 Install dependencies

First, improve PIP and install all the required packages:

pip install --upgrade pip
pip install -r requirements.txt

3.3 Creation of a data set specific to a newspaper for OCR

Make sure your data set is annotated in Coconut format with relevant classes like:

title
subheading
textblock
caption
author
page_number

The structure of the file should look like this:

datasets/
├── train2017/
├── val2017/
└── annotations/
    ├── instances_train2017.json
    └── instances_val2017.json

3.4 Configure the YOLOX experience

Create an experienced file to:

exps/example/custom/newspaper_yolox.py

Define the training parameters such as the number of classes, the data paths and the size of the lot:

self.num_classes = 6
self.data_dir = "datasets"
self.train_ann = "annotations/instances_train2017.json"

3.5 Start training

Run this order to start training:

python tools/train.py -expn newspaper_yolox -d 1 -b 8 --fp16

-expn: Name of your experience
-d: Number of GPUs
-b: Lot size
--fp16: Allows mixed clarification (faster on GPU)

3.6 Save the best model

Once the training is completed, use the best control point found on:

YOLOX_outputs/newspaper_yolox/best_ckpt.pth

4. How the structured OCR works for newspapers

Let’s break down the complete pipeline, from the detection of arrangement to the structured exit.

4.1 Detection of layout blocks with YOLOX

First of all, the image goes through the YOLOX model formed. It detects different provision components such as:

Titles and subtitles
Body text blocks
Legends and authors
Illustrations and page numbers

For each block, Yolox returns the delimitation boxes, labels and trust scores. These boxes are then cropped to isolate individual regions.

4.2 Choose the right OCR engine

Then each cropped block is transmitted to an OCR engine. Depending on the type and size of the block, we choose:

Easyocr: Fast and precise for clean text
All llm: More powerful for noisy, wrapped or stylized blocks

This decision can be taken automatically using a simple logic in your code.

4.3 Fast engineering for better OCR outing

To make the most of the vision language model, use personalized prompts for each type of block.

For example:

“Extract the full title of this image. Do not include legends or author names. ”

These guests help the LLM focus on what matters. You can customize the prompts in functions.py For each type of content.

4.4 Structure the output

Once the text is extracted, we group and label each block. This step includes:

Sorting blocks from top to bottom and from left to right
Legends of correspondence with illustrations
Link the authors with the nearby titles

Finally, we create a structured json:

{
  "title": "New Discovery in AI",
  "author": "Jane Doe",
  "text": "Researchers at XYZ University...",
  "caption": "Illustration of the AI model."
}

With Yolox and Vision LLM, you can finally create a Structured OCR for newspapers which offers a clean and labeled outing.

5. Challenges in the construction of the structured OCR for newspapers

Building this system was not easy. Here are some real challenges we have faced – and how we resolved them.

5.1 Complex helps

The newspapers do not respect the rules. The articles envelop advertisements. The titles are next to unrelated images. To train Yolox well, we needed many various examples.

The key lesson: annotate a wide range of layouts and fonts to obtain consistent results.

5.2 OCR fights with noisy analyzes

Low quality analyzes are a real problem. The blurred text and the Easyocr confused ink spots.

The transition to LLM vision for key blocks (such as titles or legends) has considerably improved the results, but it added the cost and latency.

5.3 Speeding and precision balancing

Vision LLM was precise, but slow and costly. Thus, we added a rocking to choose between easyocr (Fast) and Vision LLM (precise) depending on the use case.

In this way, users could balance performance and quality.

5.4 Annot the data set

The manually labeling arrangement blocks took time, but it was essential. We used tools like Label studio To speed up the annotation.

In the future, pre-formed layout models could help reduce this workload.

5.5 Regions related to correspondence

It was not always easy to connect the authors to their articles or legends to the illustrations. We used proximity rules to group the blocks nearby, but it was not perfect.

Potential improvement could be to use layout graphics Or Document analysis models.

6. Conclusion

The OCR for newspapers is difficult, but not impossible. Standard tools alone do not cut it. You need awareness at the disposal, intelligent extraction and structured exit.

By training Yolox In specific newspapers, we have detected significant regions such as titles, legends and authors. With Easyocr And LLM visionWe have extracted clean text – even difficult scans.

The end result? A JSON structured and labeled ready for indexing, research or digital archives.

Whether you scan archives or automatize editorial tasks, this Structured OCR for newspapers The pipeline is powerful, scalable and open source.

Thank you for reading! Try the pipeline, improve it and share your results. We would like to see what you are building.

Darshan
4 badges

Darshan, a software engineer, specializes in automatic learning, manufacturing intelligent systems that revolutionize automation. Expertise in data -based algorithms guarantees great precision and adaptive models, offering dynamic and innovative solutions.

web Design

Berita Olahraga

Lowongan Kerja

Berita Terkini

Berita Terbaru

Berita Teknologi

Seputar Teknologi

Drakor Terbaru

Resep Masakan

Pendidikan

Berita Terbaru

Berita Terbaru

Berita Terbaru