Skip to content

OCR Queue System Improvements

This document describes the major improvements made to handle large-scale OCR processing of 100k+ files.

Key Improvements

1. Database-Backed Queue System

  • Replaced direct processing with persistent queue table
  • Added retry mechanisms and failure tracking
  • Implemented priority-based processing
  • Added recovery for crashed workers

2. Worker Pool Architecture

  • Dedicated OCR worker processes with concurrency control
  • Configurable number of concurrent jobs
  • Graceful shutdown and error handling
  • Automatic stale job recovery

3. Batch Processing Support

  • Dedicated CLI tool for bulk ingestion
  • Processes files in configurable batches (default: 1000)
  • Concurrent file I/O with semaphore limiting
  • Progress monitoring and statistics

4. Priority-Based Processing

Priority levels based on file size: - Priority 10: ≤ 1MB files (highest) - Priority 8: 1-5MB files - Priority 6: 5-10MB files - Priority 4: 10-50MB files - Priority 2: > 50MB files (lowest)

5. Monitoring & Observability

  • Real-time queue statistics API
  • Progress tracking and ETAs
  • Failed job requeuing
  • Automatic cleanup of old completed jobs

Database Schema

OCR Queue Table

CREATE TABLE ocr_queue (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
    status VARCHAR(20) DEFAULT 'pending',
    priority INT DEFAULT 5,
    attempts INT DEFAULT 0,
    max_attempts INT DEFAULT 3,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    started_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    error_message TEXT,
    worker_id VARCHAR(100),
    processing_time_ms INT,
    file_size BIGINT
);

Document Status Tracking

  • ocr_status: Current OCR processing status
  • ocr_error: Error message if OCR failed
  • ocr_completed_at: Timestamp when OCR completed

API Endpoints

Queue Status

GET /api/queue/stats
Returns:
{
    "pending": 1500,
    "processing": 8,
    "failed": 12,
    "completed_today": 5420,
    "avg_wait_time_minutes": 3.2,
    "oldest_pending_minutes": 15.7
}

Requeue Failed Jobs

POST /api/queue/requeue/failed
Requeues all failed jobs that haven't exceeded max attempts.

CLI Tools

Batch Ingestion

# Ingest all files from a directory
cargo run --bin batch_ingest /path/to/files --user-id 00000000-0000-0000-0000-000000000000

# Ingest and monitor progress
cargo run --bin batch_ingest /path/to/files --user-id USER_ID --monitor

Configuration

Environment Variables

  • OCR_CONCURRENT_JOBS: Number of concurrent OCR workers (default: 4)
  • OCR_TIMEOUT_SECONDS: OCR processing timeout (default: 300)
  • QUEUE_BATCH_SIZE: Batch size for processing (default: 1000)
  • MAX_CONCURRENT_IO: Max concurrent file operations (default: 50)

User Settings

Users can configure: - concurrent_ocr_jobs: Max concurrent jobs for their documents - ocr_timeout_seconds: Processing timeout - enable_background_ocr: Enable/disable automatic OCR

Performance Optimizations

1. Memory Management

  • Streaming file reads for large files
  • Configurable memory limits per worker
  • Automatic cleanup of temporary data

2. I/O Optimization

  • Batch database operations
  • Connection pooling
  • Concurrent file processing with limits

3. Resource Control

  • CPU priority settings
  • Memory limit enforcement
  • Configurable worker counts

4. Failure Handling

  • Exponential backoff for retries
  • Separate failed job recovery
  • Automatic stale job detection

Monitoring & Maintenance

Automatic Tasks

  • Stale Recovery: Every 5 minutes, recover jobs stuck in processing
  • Cleanup: Daily cleanup of completed jobs older than 7 days
  • Health Checks: Worker health monitoring and restart

Manual Operations

-- Check queue health
SELECT * FROM get_ocr_queue_stats();

-- Find problematic jobs
SELECT * FROM ocr_queue WHERE status = 'failed' ORDER BY created_at;

-- Requeue specific job
UPDATE ocr_queue SET status = 'pending', attempts = 0 WHERE id = 'job-id';

Scalability Improvements

For 100k+ Files:

  1. Horizontal Scaling: Multiple worker instances across servers
  2. Database Optimization: Partitioned queue tables by date
  3. Caching: Redis cache for frequently accessed metadata
  4. Load Balancing: Distribute workers across multiple machines

Performance Metrics:

  • Throughput: ~500-1000 files/hour per worker (depends on file size)
  • Memory Usage: ~100MB per worker + file size
  • Database Load: Optimized with proper indexing and batching

Migration Guide

From Old System:

  1. Run database migration: migrations/001_add_ocr_queue.sql
  2. Update application code to use queue endpoints
  3. Monitor existing processing and let queue drain
  4. Start new workers with queue system

Zero-Downtime Migration:

  1. Deploy new code with feature flag disabled
  2. Run migration scripts
  3. Enable queue processing gradually
  4. Monitor and adjust worker counts as needed