Skip to content

OCR Optimization Guide

Current State: Enhanced OCR vs Simple OCR

Based on extensive analysis and testing, simple OCR processing consistently produces better results than the "enhanced" preprocessing pipeline.

Why Simple OCR Works Better

1. Information Preservation

  • No resolution loss: Maintains original scan quality and fine details
  • No processing artifacts: Avoids haloing, false edges, and compression artifacts
  • Original color information: Preserves color contrasts that help text recognition

2. Modern Tesseract Capabilities

  • Built-in preprocessing: Tesseract 4.x+ has excellent internal preprocessing optimized for OCR
  • Adaptive thresholding: Tesseract automatically handles varying lighting and contrast
  • Multiple recognition passes: Uses different algorithms internally for optimal results

3. Research-Backed Approach

  • High-resolution images (300+ DPI) consistently outperform downscaled versions
  • Minimal preprocessing reduces error accumulation from multiple processing steps
  • Original images retain maximum information for OCR engines to analyze

Optimal Configuration

{
  "enable_image_preprocessing": false,
  "auto_rotate_images": true,
  "ocr_dpi": 300
}

🔧 Tesseract Configuration

  • Page Segmentation Mode: PSM 3 (fully automatic page segmentation, but no OSD)
  • OCR Engine Mode: OEM 3 (default, based on what is available)
  • Language: Specify primary document language for better accuracy

📏 Image Guidelines

  • Minimum Resolution: 150 DPI for acceptable results, 300+ DPI for optimal
  • Maximum Size: No artificial limits - let Tesseract handle large images
  • Format: Keep original format when possible (TIFF, PNG preferred over JPEG)

Performance Comparison

Approach Accuracy Speed Memory Usage File Size
Simple OCR 95%+ Fast Low Original
Enhanced OCR 80-90% Slow High 2x larger

When to Use Enhanced Processing

Enhanced preprocessing should only be used for: - Severely degraded documents (damaged, faded, extremely poor scans) - Non-standard document types (handwritten notes, artistic text) - Specialized use cases where manual tuning is required

For 95% of typical documents (PDFs, scanned papers, photos of text), simple OCR produces superior results.

Implementation Changes

The default has been changed to: - enable_image_preprocessing: false (was true) - This immediately improves OCR accuracy for most users - Users can still enable enhanced processing if needed for specific documents

Migration Note

Existing users with enable_image_preprocessing: true should consider switching to false for better results. The enhanced processing can always be re-enabled for specific problematic documents.