Architecture Overview¶
This document provides a comprehensive overview of Readur's architecture, design decisions, and technical implementation details.
Table of Contents¶
- System Architecture
- Technology Stack
- Component Overview
- Backend (Rust/Axum)
- Frontend (React)
- Database (PostgreSQL)
- OCR Engine
- Data Flow
- Security Architecture
- Performance Considerations
- Scalability
- Design Patterns
System Architecture¶
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ React Frontend │────│ Rust Backend │────│ PostgreSQL DB │
│ (Port 8000) │ │ (Axum API) │ │ (Port 5433) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ ┌─────────────────┐ │
└──────────────│ File Storage │─────────────┘
│ + OCR Engine │
└─────────────────┘
High-Level Components¶
- Web Interface: Modern React SPA with Material-UI
- API Server: High-performance Rust backend using Axum
- Database: PostgreSQL with full-text search capabilities
- File Storage: Local or network-mounted filesystem
- OCR Processing: Tesseract integration for text extraction
- Background Jobs: Async task processing for OCR and file watching
Technology Stack¶
Backend¶
- Language: Rust (for performance and memory safety)
- Web Framework: Axum (async, fast, type-safe)
- Database ORM: SQLx (compile-time checked queries)
- Authentication: JWT tokens with bcrypt password hashing
- Async Runtime: Tokio
- Serialization: Serde
Frontend¶
- Framework: React 18 with TypeScript
- UI Library: Material-UI (MUI)
- State Management: React Context + Hooks
- Build Tool: Vite
- HTTP Client: Axios
- Routing: React Router
Infrastructure¶
- Database: PostgreSQL 14+ with pgvector extension
- OCR: Tesseract 4.0+
- Container: Docker with multi-stage builds
- Reverse Proxy: Nginx/Traefik compatible
Component Overview¶
Backend (Rust/Axum)¶
The backend is structured following clean architecture principles:
src/
├── main.rs # Application entry and server setup
├── config.rs # Configuration management
├── models.rs # Domain models and DTOs
├── error.rs # Error handling
├── auth.rs # Authentication middleware
├── routes/ # HTTP route handlers
│ ├── auth.rs # Authentication endpoints
│ ├── documents.rs # Document CRUD operations
│ ├── search.rs # Search functionality
│ └── ...
├── db/ # Database operations
│ ├── documents.rs # Document queries
│ ├── users.rs # User queries
│ └── ...
├── services/ # Business logic
│ ├── ocr.rs # OCR processing
│ ├── file_service.rs # File management
│ └── watcher.rs # Folder watching
└── tests/ # Integration tests
Key design decisions: - Async-first: All I/O operations are async - Type safety: Leverages Rust's type system - Error handling: Comprehensive error types - Dependency injection: Clean separation of concerns
Frontend (React)¶
The frontend follows a component-based architecture:
frontend/src/
├── components/ # Reusable UI components
│ ├── DocumentList/
│ ├── SearchBar/
│ └── ...
├── pages/ # Page-level components
│ ├── Dashboard/
│ ├── Documents/
│ └── ...
├── services/ # API integration
│ ├── api.ts # Base API client
│ ├── auth.ts # Auth service
│ └── documents.ts # Document service
├── hooks/ # Custom React hooks
├── contexts/ # React contexts
└── utils/ # Utility functions
Database (PostgreSQL)¶
Schema design optimized for document management:
-- Core tables
users # User accounts
documents # Document metadata
document_content # Extracted text content
document_tags # Many-to-many tags
sources # File sources (folders, S3, etc.)
ocr_queue # OCR processing queue
-- Search optimization
document_search_index # Full-text search index
Key features: - Full-text search: PostgreSQL's powerful search capabilities - JSONB fields: Flexible metadata storage - Triggers: Automatic search index updates - Views: Optimized query patterns
OCR Engine¶
OCR processing pipeline:
- File Detection: New files detected via upload or folder watch
- Queue Management: Files added to processing queue
- Pre-processing: Image enhancement and optimization
- Text Extraction: Tesseract OCR with language detection
- Post-processing: Text cleaning and formatting
- Database Storage: Indexed for search
Data Flow¶
Document Upload Flow¶
sequenceDiagram
User->>Frontend: Upload Document
Frontend->>API: POST /api/documents
API->>FileStorage: Save File
API->>Database: Create Document Record
API->>OCRQueue: Add to Queue
API-->>Frontend: Document Created
OCRWorker->>OCRQueue: Poll for Jobs
OCRWorker->>FileStorage: Read File
OCRWorker->>Tesseract: Extract Text
OCRWorker->>Database: Update with Content
OCRWorker->>Frontend: WebSocket Update
Search Flow¶
sequenceDiagram
User->>Frontend: Enter Search Query
Frontend->>API: GET /api/search
API->>Database: Full-text Search
Database->>API: Ranked Results
API->>Frontend: Search Results
Frontend->>User: Display Results
Security Architecture¶
Authentication & Authorization¶
- JWT Tokens: Stateless authentication
- Role-Based Access: Admin, User roles
- Token Refresh: Automatic token renewal
- Password Security: Bcrypt with salt rounds
API Security¶
- CORS: Configurable allowed origins
- Rate Limiting: Prevent abuse
- Input Validation: Comprehensive validation
- SQL Injection: Parameterized queries via SQLx
File Security¶
- Upload Validation: File type and size checks
- Virus Scanning: Optional ClamAV integration
- Access Control: Document-level permissions
- Secure Storage: Filesystem permissions
Performance Considerations¶
Backend Optimization¶
- Connection Pooling: Database connection reuse
- Async I/O: Non-blocking operations
- Caching: In-memory caching for hot data
- Query Optimization: Indexed searches
Frontend Optimization¶
- Code Splitting: Lazy loading of routes
- Virtual Scrolling: Large document lists
- Memoization: Prevent unnecessary re-renders
- Service Workers: Offline capability
OCR Optimization¶
- Parallel Processing: Multiple concurrent jobs
- Image Pre-processing: Enhance OCR accuracy
- Resource Limits: Memory and CPU constraints
- Queue Priority: Smart job scheduling
Scalability¶
Horizontal Scaling¶
# Multiple backend instances
backend-1:
image: readur:latest
environment:
- INSTANCE_ID=1
backend-2:
image: readur:latest
environment:
- INSTANCE_ID=2
# Load balancer
nginx:
upstream backend {
server backend-1:8000;
server backend-2:8000;
}
Database Scaling¶
- Read Replicas: Distribute read load
- Connection Pooling: PgBouncer
- Partitioning: Time-based partitions
- Archival: Move old documents
Storage Scaling¶
- S3 Compatible: Object storage support
- CDN Integration: Static file delivery
- Distributed Storage: GlusterFS/Ceph
- Archive Tiering: Hot/cold storage
Design Patterns¶
Backend Patterns¶
- Repository Pattern: Database abstraction
- Service Layer: Business logic separation
- Middleware Chain: Request processing
- Error Boundaries: Graceful error handling
Frontend Patterns¶
- Container/Presenter: Component separation
- Custom Hooks: Logic reuse
- Context Provider: State management
- HOCs: Cross-cutting concerns
Database Patterns¶
- Soft Deletes: Data preservation
- Audit Trails: Change tracking
- Materialized Views: Performance
- Event Sourcing: Optional audit log
Future Architecture Considerations¶
Microservices Migration¶
Potential service boundaries: - Authentication Service - Document Service - OCR Service - Search Service - Notification Service
Event-Driven Architecture¶
- Message Queue (RabbitMQ/Kafka)
- Event Sourcing
- CQRS Pattern
- Async communication
Cloud-Native Features¶
- Kubernetes deployment
- Service mesh (Istio)
- Distributed tracing
- Cloud storage integration
Monitoring and Observability¶
Metrics¶
- Prometheus metrics endpoint
- Custom business metrics
- Performance counters
- Resource utilization
Logging¶
- Structured logging (JSON)
- Log aggregation ready
- Correlation IDs
- Debug levels
Tracing¶
- OpenTelemetry support
- Distributed tracing
- Performance profiling
- Request tracking
Next Steps¶
- Review deployment options
- Explore performance tuning
- Understand database design
- Learn about testing strategy