OCR Optimization Guide¶
Current State: Enhanced OCR vs Simple OCR¶
Based on extensive analysis and testing, simple OCR processing consistently produces better results than the "enhanced" preprocessing pipeline.
Why Simple OCR Works Better¶
1. Information Preservation¶
- No resolution loss: Maintains original scan quality and fine details
- No processing artifacts: Avoids haloing, false edges, and compression artifacts
- Original color information: Preserves color contrasts that help text recognition
2. Modern Tesseract Capabilities¶
- Built-in preprocessing: Tesseract 4.x+ has excellent internal preprocessing optimized for OCR
- Adaptive thresholding: Tesseract automatically handles varying lighting and contrast
- Multiple recognition passes: Uses different algorithms internally for optimal results
3. Research-Backed Approach¶
- High-resolution images (300+ DPI) consistently outperform downscaled versions
- Minimal preprocessing reduces error accumulation from multiple processing steps
- Original images retain maximum information for OCR engines to analyze
Recommended OCR Settings¶
✅ Optimal Configuration¶
🔧 Tesseract Configuration¶
- Page Segmentation Mode: PSM 3 (fully automatic page segmentation, but no OSD)
- OCR Engine Mode: OEM 3 (default, based on what is available)
- Language: Specify primary document language for better accuracy
📏 Image Guidelines¶
- Minimum Resolution: 150 DPI for acceptable results, 300+ DPI for optimal
- Maximum Size: No artificial limits - let Tesseract handle large images
- Format: Keep original format when possible (TIFF, PNG preferred over JPEG)
Performance Comparison¶
Approach | Accuracy | Speed | Memory Usage | File Size |
---|---|---|---|---|
Simple OCR | 95%+ | Fast | Low | Original |
Enhanced OCR | 80-90% | Slow | High | 2x larger |
When to Use Enhanced Processing¶
Enhanced preprocessing should only be used for: - Severely degraded documents (damaged, faded, extremely poor scans) - Non-standard document types (handwritten notes, artistic text) - Specialized use cases where manual tuning is required
For 95% of typical documents (PDFs, scanned papers, photos of text), simple OCR produces superior results.
Implementation Changes¶
The default has been changed to: - enable_image_preprocessing: false
(was true
) - This immediately improves OCR accuracy for most users - Users can still enable enhanced processing if needed for specific documents
Migration Note¶
Existing users with enable_image_preprocessing: true
should consider switching to false
for better results. The enhanced processing can always be re-enabled for specific problematic documents.