OCR Optimization Guide¶

Current State: Enhanced OCR vs Simple OCR¶

Based on extensive analysis and testing, simple OCR processing consistently produces better results than the "enhanced" preprocessing pipeline.

Why Simple OCR Works Better¶

1. Information Preservation¶

No resolution loss: Maintains original scan quality and fine details
No processing artifacts: Avoids haloing, false edges, and compression artifacts
Original color information: Preserves color contrasts that help text recognition

2. Modern Tesseract Capabilities¶

Built-in preprocessing: Tesseract 4.x+ has excellent internal preprocessing optimized for OCR
Adaptive thresholding: Tesseract automatically handles varying lighting and contrast
Multiple recognition passes: Uses different algorithms internally for optimal results

3. Research-Backed Approach¶

High-resolution images (300+ DPI) consistently outperform downscaled versions
Minimal preprocessing reduces error accumulation from multiple processing steps
Original images retain maximum information for OCR engines to analyze

Recommended OCR Settings¶

✅ Optimal Configuration¶

{
  "enable_image_preprocessing": false,
  "auto_rotate_images": true,
  "ocr_dpi": 300
}

🔧 Tesseract Configuration¶

Page Segmentation Mode: PSM 3 (fully automatic page segmentation, but no OSD)
OCR Engine Mode: OEM 3 (default, based on what is available)
Language: Specify primary document language for better accuracy

📏 Image Guidelines¶

Minimum Resolution: 150 DPI for acceptable results, 300+ DPI for optimal
Maximum Size: No artificial limits - let Tesseract handle large images
Format: Keep original format when possible (TIFF, PNG preferred over JPEG)

Performance Comparison¶

Approach	Accuracy	Speed	Memory Usage	File Size
Simple OCR	95%+	Fast	Low	Original
Enhanced OCR	80-90%	Slow	High	2x larger

When to Use Enhanced Processing¶

Enhanced preprocessing should only be used for: - Severely degraded documents (damaged, faded, extremely poor scans) - Non-standard document types (handwritten notes, artistic text) - Specialized use cases where manual tuning is required

For 95% of typical documents (PDFs, scanned papers, photos of text), simple OCR produces superior results.

Implementation Changes¶

The default has been changed to: - enable_image_preprocessing: false (was true) - This immediately improves OCR accuracy for most users - Users can still enable enhanced processing if needed for specific documents

Migration Note¶

Existing users with enable_image_preprocessing: true should consider switching to false for better results. The enhanced processing can always be re-enabled for specific problematic documents.