Skip to content

Monitoring and Observability Guide

Overview

This guide covers setting up comprehensive monitoring for Readur, including metrics collection, log aggregation, alerting, and dashboard creation.

Monitoring Stack Components

Core Components

  1. Metrics Collection: Prometheus + Node Exporter
  2. Visualization: Grafana
  3. Log Aggregation: Loki or ELK Stack
  4. Alerting: AlertManager
  5. Application Monitoring: Custom metrics and health checks
  6. Uptime Monitoring: Uptime Kuma or Pingdom

Health Monitoring

Built-in Health Endpoints

# Basic health check
curl http://localhost:8000/health

# Detailed health status
curl http://localhost:8000/health/detailed

# Response format
{
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "storage": "accessible",
  "ocr_queue": 45,
  "version": "2.5.4",
  "uptime": 345600
}

Custom Health Checks

# health_checks.py
from typing import Dict, Any

class HealthMonitor:
    @staticmethod
    def check_database() -> Dict[str, Any]:
        try:
            db.session.execute("SELECT 1")
            return {"status": "healthy", "response_time": 0.005}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

    @staticmethod
    def check_storage() -> Dict[str, Any]:
        try:
            # Check if storage is accessible
            storage.list_files(limit=1)
            return {"status": "healthy", "available_space": storage.get_free_space()}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

    @staticmethod
    def check_ocr_workers() -> Dict[str, Any]:
        active = celery.control.inspect().active()
        return {
            "status": "healthy" if active else "degraded",
            "active_workers": len(active or {}),
            "queue_length": redis.llen("ocr_queue")
        }

Prometheus Setup

Installation and Configuration

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:latest
    container_name: postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://readur:password@postgres:5432/readur?sslmode=disable"
    ports:
      - "9187:9187"
    networks:
      - monitoring

  redis-exporter:
    image: oliver006/redis_exporter:latest
    container_name: redis-exporter
    environment:
      REDIS_ADDR: "redis://redis:6379"
    ports:
      - "9121:9121"
    networks:
      - monitoring

networks:
  monitoring:
    external: true

volumes:
  prometheus_data:

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'readur-monitor'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - '/etc/prometheus/alerts/*.yml'

scrape_configs:
  - job_name: 'readur'
    static_configs:
      - targets: ['readur:8000']
    metrics_path: '/metrics'

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

Grafana Dashboards

Setup Grafana

# Add to docker-compose.monitoring.yml
grafana:
  image: grafana/grafana:latest
  container_name: grafana
  environment:
    - GF_SECURITY_ADMIN_USER=admin
    - GF_SECURITY_ADMIN_PASSWORD=changeme
    - GF_SERVER_ROOT_URL=https://grafana.readur.company.com
    - GF_INSTALL_PLUGINS=redis-datasource
  volumes:
    - grafana_data:/var/lib/grafana
    - ./grafana/provisioning:/etc/grafana/provisioning
  ports:
    - "3000:3000"
  networks:
    - monitoring

Dashboard Configuration

# grafana/provisioning/dashboards/readur.json
{
  "dashboard": {
    "title": "Readur Performance Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(readur_requests_total[5m])"
        }]
      },
      {
        "title": "Response Time",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "OCR Queue",
        "targets": [{
          "expr": "readur_ocr_queue_length"
        }]
      },
      {
        "title": "Database Connections",
        "targets": [{
          "expr": "pg_stat_database_numbackends{datname='readur'}"
        }]
      }
    ]
  }
}

Application Metrics

Custom Metrics Implementation

# metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest

# Define metrics
request_count = Counter('readur_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('readur_request_duration_seconds', 'Request duration')
ocr_queue_length = Gauge('readur_ocr_queue_length', 'OCR queue length')
active_users = Gauge('readur_active_users', 'Active users in last 5 minutes')
document_count = Gauge('readur_documents_total', 'Total documents', ['status'])

# Middleware to track requests
class MetricsMiddleware:
    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        path = environ.get('PATH_INFO', '/')
        method = environ.get('REQUEST_METHOD', 'GET')

        with request_duration.time():
            request_count.labels(method=method, endpoint=path).inc()
            return self.app(environ, start_response)

# Metrics endpoint
@app.route('/metrics')
def metrics():
    # Update gauges
    ocr_queue_length.set(redis.llen('ocr_queue'))
    active_users.set(get_active_user_count())
    document_count.labels(status='processed').set(get_document_count('processed'))

    return generate_latest(), 200, {'Content-Type': 'text/plain'}

Log Aggregation

Loki Setup

# Add to docker-compose.monitoring.yml
loki:
  image: grafana/loki:latest
  container_name: loki
  ports:
    - "3100:3100"
  volumes:
    - ./loki/loki-config.yml:/etc/loki/loki-config.yml
    - loki_data:/loki
  command: -config.file=/etc/loki/loki-config.yml
  networks:
    - monitoring

promtail:
  image: grafana/promtail:latest
  container_name: promtail
  volumes:
    - /var/log:/var/log:ro
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
  command: -config.file=/etc/promtail/promtail-config.yml
  networks:
    - monitoring

Log Configuration

# promtail/promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: readur
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
        filters:
          - name: label
            values: ["com.docker.compose.project=readur"]
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'logstream'

Alerting

AlertManager Configuration

# alertmanager/config.yml
global:
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.company.com:587'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-admins'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'team-admins'

receivers:
  - name: 'team-admins'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Readur Alert: {{ .GroupLabels.alertname }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'

Alert Rules

# prometheus/alerts/readur.yml
groups:
  - name: readur
    rules:
      - alert: HighResponseTime
        expr: histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time on {{ $labels.instance }}"
          description: "95th percentile response time is {{ $value }}s"

      - alert: DatabaseDown
        expr: up{job="postgres"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database is down"
          description: "PostgreSQL database is not responding"

      - alert: HighOCRQueue
        expr: readur_ocr_queue_length > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OCR queue backlog"
          description: "OCR queue has {{ $value }} pending items"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"

Performance Monitoring

APM Integration

# apm_config.py
from elasticapm import Client

# Configure APM
apm_client = Client({
    'SERVICE_NAME': 'readur',
    'SERVER_URL': 'http://apm-server:8200',
    'ENVIRONMENT': 'production',
    'SECRET_TOKEN': 'your-secret-token',
})

# Instrument Flask app
from elasticapm.contrib.flask import ElasticAPM
apm = ElasticAPM(app, client=apm_client)

Custom Performance Metrics

# performance_metrics.py
import time
from contextlib import contextmanager

@contextmanager
def track_performance(operation_name):
    start_time = time.time()
    try:
        yield
    finally:
        duration = time.time() - start_time
        metrics.record_operation_time(operation_name, duration)

        if duration > 1.0:  # Log slow operations
            logger.warning(f"Slow operation: {operation_name} took {duration:.2f}s")

# Usage
with track_performance("document_processing"):
    process_document(doc_id)

Uptime Monitoring

External Monitoring

# uptime-kuma/docker-compose.yml
version: '3.8'

services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    container_name: uptime-kuma
    volumes:
      - uptime-kuma_data:/app/data
    ports:
      - "3001:3001"
    restart: unless-stopped

volumes:
  uptime-kuma_data:

Status Page Configuration

# Public status page
server {
    listen 443 ssl;
    server_name status.readur.company.com;

    location / {
        proxy_pass http://localhost:3001;
        proxy_set_header Host $host;
    }
}

Dashboard Examples

Key Metrics Dashboard

-- Query for document processing stats
SELECT 
    DATE(created_at) as date,
    COUNT(*) as documents_processed,
    AVG(processing_time) as avg_processing_time,
    MAX(processing_time) as max_processing_time
FROM documents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Real-time Monitoring

// WebSocket monitoring dashboard
const ws = new WebSocket('wss://readur.company.com/ws/metrics');

ws.onmessage = (event) => {
    const metrics = JSON.parse(event.data);
    updateDashboard({
        activeUsers: metrics.active_users,
        queueLength: metrics.queue_length,
        responseTime: metrics.response_time,
        errorRate: metrics.error_rate
    });
};

Troubleshooting Monitoring Issues

Prometheus Not Scraping

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Verify metrics endpoint
curl http://localhost:8000/metrics

# Check network connectivity
docker network inspect monitoring

Missing Metrics

# Debug metric collection
docker-compose exec readur python -c "
from prometheus_client import REGISTRY
for collector in REGISTRY._collector_to_names:
    print(collector)
"

High Memory Usage

# Check Prometheus storage
du -sh /var/lib/prometheus

# Reduce retention
docker-compose exec prometheus promtool tsdb analyze /prometheus

# Clean old data
docker-compose exec prometheus promtool tsdb clean /prometheus

Best Practices

Monitoring Strategy

  1. Start Simple: Begin with basic health checks and expand
  2. Alert Fatigue: Only alert on actionable issues
  3. SLI/SLO Definition: Define and track service level indicators
  4. Dashboard Organization: Create role-specific dashboards
  5. Log Retention: Balance storage costs with debugging needs
  6. Security: Protect monitoring endpoints and dashboards
  7. Documentation: Document alert runbooks and response procedures

Maintenance

# Weekly maintenance tasks
#!/bin/bash

# Rotate logs
docker-compose exec readur logrotate -f /etc/logrotate.conf

# Clean up old metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

# Backup Grafana dashboards
docker-compose exec grafana grafana-cli admin export-dashboard

# Update monitoring stack
docker-compose -f docker-compose.monitoring.yml pull
docker-compose -f docker-compose.monitoring.yml up -d