OCR Queue System Improvements¶
This document describes the major improvements made to handle large-scale OCR processing of 100k+ files.
Key Improvements¶
1. Database-Backed Queue System¶
- Replaced direct processing with persistent queue table
- Added retry mechanisms and failure tracking
- Implemented priority-based processing
- Added recovery for crashed workers
2. Worker Pool Architecture¶
- Dedicated OCR worker processes with concurrency control
- Configurable number of concurrent jobs
- Graceful shutdown and error handling
- Automatic stale job recovery
3. Batch Processing Support¶
- Dedicated CLI tool for bulk ingestion
- Processes files in configurable batches (default: 1000)
- Concurrent file I/O with semaphore limiting
- Progress monitoring and statistics
4. Priority-Based Processing¶
Priority levels based on file size: - Priority 10: ≤ 1MB files (highest) - Priority 8: 1-5MB files - Priority 6: 5-10MB files - Priority 4: 10-50MB files - Priority 2: > 50MB files (lowest)
5. Monitoring & Observability¶
- Real-time queue statistics API
- Progress tracking and ETAs
- Failed job requeuing
- Automatic cleanup of old completed jobs
Database Schema¶
OCR Queue Table¶
CREATE TABLE ocr_queue (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
status VARCHAR(20) DEFAULT 'pending',
priority INT DEFAULT 5,
attempts INT DEFAULT 0,
max_attempts INT DEFAULT 3,
created_at TIMESTAMPTZ DEFAULT NOW(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
error_message TEXT,
worker_id VARCHAR(100),
processing_time_ms INT,
file_size BIGINT
);
Document Status Tracking¶
ocr_status
: Current OCR processing statusocr_error
: Error message if OCR failedocr_completed_at
: Timestamp when OCR completed
API Endpoints¶
Queue Status¶
Returns:{
"pending": 1500,
"processing": 8,
"failed": 12,
"completed_today": 5420,
"avg_wait_time_minutes": 3.2,
"oldest_pending_minutes": 15.7
}
Requeue Failed Jobs¶
Requeues all failed jobs that haven't exceeded max attempts.CLI Tools¶
Batch Ingestion¶
# Ingest all files from a directory
cargo run --bin batch_ingest /path/to/files --user-id 00000000-0000-0000-0000-000000000000
# Ingest and monitor progress
cargo run --bin batch_ingest /path/to/files --user-id USER_ID --monitor
Configuration¶
Environment Variables¶
OCR_CONCURRENT_JOBS
: Number of concurrent OCR workers (default: 4)OCR_TIMEOUT_SECONDS
: OCR processing timeout (default: 300)QUEUE_BATCH_SIZE
: Batch size for processing (default: 1000)MAX_CONCURRENT_IO
: Max concurrent file operations (default: 50)
User Settings¶
Users can configure: - concurrent_ocr_jobs
: Max concurrent jobs for their documents - ocr_timeout_seconds
: Processing timeout - enable_background_ocr
: Enable/disable automatic OCR
Performance Optimizations¶
1. Memory Management¶
- Streaming file reads for large files
- Configurable memory limits per worker
- Automatic cleanup of temporary data
2. I/O Optimization¶
- Batch database operations
- Connection pooling
- Concurrent file processing with limits
3. Resource Control¶
- CPU priority settings
- Memory limit enforcement
- Configurable worker counts
4. Failure Handling¶
- Exponential backoff for retries
- Separate failed job recovery
- Automatic stale job detection
Monitoring & Maintenance¶
Automatic Tasks¶
- Stale Recovery: Every 5 minutes, recover jobs stuck in processing
- Cleanup: Daily cleanup of completed jobs older than 7 days
- Health Checks: Worker health monitoring and restart
Manual Operations¶
-- Check queue health
SELECT * FROM get_ocr_queue_stats();
-- Find problematic jobs
SELECT * FROM ocr_queue WHERE status = 'failed' ORDER BY created_at;
-- Requeue specific job
UPDATE ocr_queue SET status = 'pending', attempts = 0 WHERE id = 'job-id';
Scalability Improvements¶
For 100k+ Files:¶
- Horizontal Scaling: Multiple worker instances across servers
- Database Optimization: Partitioned queue tables by date
- Caching: Redis cache for frequently accessed metadata
- Load Balancing: Distribute workers across multiple machines
Performance Metrics:¶
- Throughput: ~500-1000 files/hour per worker (depends on file size)
- Memory Usage: ~100MB per worker + file size
- Database Load: Optimized with proper indexing and batching
Migration Guide¶
From Old System:¶
- Run database migration:
migrations/001_add_ocr_queue.sql
- Update application code to use queue endpoints
- Monitor existing processing and let queue drain
- Start new workers with queue system
Zero-Downtime Migration:¶
- Deploy new code with feature flag disabled
- Run migration scripts
- Enable queue processing gradually
- Monitor and adjust worker counts as needed