1. Introduction
Build a Structured OCR for newspapers is not a simple task. Unlike books or documents, newspapers are disorderly – often noisy, biased and low resolution.
Traditional OCR tools are fighting with such complex arrangements.
The newspapers also do not follow a standard provision. They use several columns, legends, mixed fonts and articles that can jump on the pages.
For this reason, tools like Tesseract often refer mixed and unstructured text. These tools read the line by line, without understanding the context.
But what happens if you need structured data such as titles, authors, dates or page numbers? The raw text is simply not enough.
To resolve this, we will combine Yolox To detect layout blocks with LLM vision For intelligent text extraction.
This modern OCR pipeline transforms the digitized pages into clean and structured JSON – each block labeled and ordered correctly.
This blog guides you on how to build a Structured OCR for newspapers Use of modern AI tools.
Let’s dive.
2. Presentation of the project: OCR Structured for newspapers
This project helps to extract Structured content Digitized newspaper pages. The system detects layout blocks, such as titles, legends and bodies of the article – then reads the text using AI.
Here’s how it works:
- A user downloads a newspaper image.
- The system detects blocks like titles, subtitles, textAnd legends Using YOLOX.
- Each block is sent to an OCR engine:
- Easyocr For simpler contents
- LLM vision For dense or complex regions
- The extracted text is grouped and labeled.
- Clean, structured Json The file is returned.
This JSON can be used for research, digital archiving or consultable databases. It is both readable by machine and easy to understand.
Key components
- Yolox – for the detection of objects and the analysis of the arrangement
- Easyocr / Vision LLM – for flexible text extraction
- Python 3.10 – With
.envfor the management of API keys
This system can be executed locally or on a small server. A GPU helps, but it is not strictly required for tests.
3. Form Yolox for structured OCR in newspapers
Before running the pipeline, you will have to form a personalized YOLOX model which can detect the types of newspaper blocks.
3.1 Create a virtual environment
Use Python 3.10.13:
python3.10 -m venv .venv source .venv/bin/activate # macOS/Linux # .venv\Scripts\activate # Windows
3.2 Install dependencies
First, improve PIP and install all the required packages:
pip install --upgrade pip pip install -r requirements.txt
3.3 Creation of a data set specific to a newspaper for OCR
Make sure your data set is annotated in Coconut format with relevant classes like:
titlesubheadingtextblockcaptionauthorpage_number
The structure of the file should look like this:
datasets/
├── train2017/
├── val2017/
└── annotations/
├── instances_train2017.json
└── instances_val2017.json
3.4 Configure the YOLOX experience
Create an experienced file to:
exps/example/custom/newspaper_yolox.py
Define the training parameters such as the number of classes, the data paths and the size of the lot:
self.num_classes = 6 self.data_dir = "datasets" self.train_ann = "annotations/instances_train2017.json"
3.5 Start training
Run this order to start training:
python tools/train.py -expn newspaper_yolox -d 1 -b 8 --fp16
-expn: Name of your experience-d: Number of GPUs-b: Lot size--fp16: Allows mixed clarification (faster on GPU)
3.6 Save the best model
Once the training is completed, use the best control point found on:
YOLOX_outputs/newspaper_yolox/best_ckpt.pth
4. How the structured OCR works for newspapers
Let’s break down the complete pipeline, from the detection of arrangement to the structured exit.
4.1 Detection of layout blocks with YOLOX
First of all, the image goes through the YOLOX model formed. It detects different provision components such as:
- Titles and subtitles
- Body text blocks
- Legends and authors
- Illustrations and page numbers
For each block, Yolox returns the delimitation boxes, labels and trust scores. These boxes are then cropped to isolate individual regions.
4.2 Choose the right OCR engine
Then each cropped block is transmitted to an OCR engine. Depending on the type and size of the block, we choose:
- Easyocr: Fast and precise for clean text
- All llm: More powerful for noisy, wrapped or stylized blocks
This decision can be taken automatically using a simple logic in your code.
4.3 Fast engineering for better OCR outing
To make the most of the vision language model, use personalized prompts for each type of block.
For example:
“Extract the full title of this image. Do not include legends or author names. ”
These guests help the LLM focus on what matters. You can customize the prompts in functions.py For each type of content.
4.4 Structure the output
Once the text is extracted, we group and label each block. This step includes:
- Sorting blocks from top to bottom and from left to right
- Legends of correspondence with illustrations
- Link the authors with the nearby titles
Finally, we create a structured json:
{
"title": "New Discovery in AI",
"author": "Jane Doe",
"text": "Researchers at XYZ University...",
"caption": "Illustration of the AI model."
}
With Yolox and Vision LLM, you can finally create a Structured OCR for newspapers which offers a clean and labeled outing.
5. Challenges in the construction of the structured OCR for newspapers
Building this system was not easy. Here are some real challenges we have faced – and how we resolved them.
5.1 Complex helps
The newspapers do not respect the rules. The articles envelop advertisements. The titles are next to unrelated images. To train Yolox well, we needed many various examples.
The key lesson: annotate a wide range of layouts and fonts to obtain consistent results.
5.2 OCR fights with noisy analyzes
Low quality analyzes are a real problem. The blurred text and the Easyocr confused ink spots.
The transition to LLM vision for key blocks (such as titles or legends) has considerably improved the results, but it added the cost and latency.
5.3 Speeding and precision balancing
Vision LLM was precise, but slow and costly. Thus, we added a rocking to choose between easyocr (Fast) and Vision LLM (precise) depending on the use case.
In this way, users could balance performance and quality.
5.4 Annot the data set
The manually labeling arrangement blocks took time, but it was essential. We used tools like Label studio To speed up the annotation.
In the future, pre-formed layout models could help reduce this workload.
5.5 Regions related to correspondence
It was not always easy to connect the authors to their articles or legends to the illustrations. We used proximity rules to group the blocks nearby, but it was not perfect.
Potential improvement could be to use layout graphics Or Document analysis models.
6. Conclusion
The OCR for newspapers is difficult, but not impossible. Standard tools alone do not cut it. You need awareness at the disposal, intelligent extraction and structured exit.
By training Yolox In specific newspapers, we have detected significant regions such as titles, legends and authors. With Easyocr And LLM visionWe have extracted clean text – even difficult scans.
The end result? A JSON structured and labeled ready for indexing, research or digital archives.
Whether you scan archives or automatize editorial tasks, this Structured OCR for newspapers The pipeline is powerful, scalable and open source.
Thank you for reading! Try the pipeline, improve it and share your results. We would like to see what you are building.
web Design
Berita Olahraga
Lowongan Kerja
Berita Terkini
Berita Terbaru
Berita Teknologi
Seputar Teknologi
Drakor Terbaru
Resep Masakan
Pendidikan
Berita Terbaru
Berita Terbaru
Berita Terbaru
Comments are closed, but trackbacks and pingbacks are open.