Monitoring and Observability Guide¶
Overview¶
This guide covers setting up comprehensive monitoring for Readur, including metrics collection, log aggregation, alerting, and dashboard creation.
Monitoring Stack Components¶
Core Components¶
- Metrics Collection: Prometheus + Node Exporter
- Visualization: Grafana
- Log Aggregation: Loki or ELK Stack
- Alerting: AlertManager
- Application Monitoring: Custom metrics and health checks
- Uptime Monitoring: Uptime Kuma or Pingdom
Health Monitoring¶
Built-in Health Endpoints¶
# Basic health check
curl http://localhost:8000/health
# Detailed health status
curl http://localhost:8000/health/detailed
# Response format
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"storage": "accessible",
"ocr_queue": 45,
"version": "2.5.4",
"uptime": 345600
}
Custom Health Checks¶
# health_checks.py
from typing import Dict, Any
class HealthMonitor:
@staticmethod
def check_database() -> Dict[str, Any]:
try:
db.session.execute("SELECT 1")
return {"status": "healthy", "response_time": 0.005}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
@staticmethod
def check_storage() -> Dict[str, Any]:
try:
# Check if storage is accessible
storage.list_files(limit=1)
return {"status": "healthy", "available_space": storage.get_free_space()}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
@staticmethod
def check_ocr_workers() -> Dict[str, Any]:
active = celery.control.inspect().active()
return {
"status": "healthy" if active else "degraded",
"active_workers": len(active or {}),
"queue_length": redis.llen("ocr_queue")
}
Prometheus Setup¶
Installation and Configuration¶
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- monitoring
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
container_name: postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://readur:password@postgres:5432/readur?sslmode=disable"
ports:
- "9187:9187"
networks:
- monitoring
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
environment:
REDIS_ADDR: "redis://redis:6379"
ports:
- "9121:9121"
networks:
- monitoring
networks:
monitoring:
external: true
volumes:
prometheus_data:
Prometheus Configuration¶
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'readur-monitor'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- '/etc/prometheus/alerts/*.yml'
scrape_configs:
- job_name: 'readur'
static_configs:
- targets: ['readur:8000']
metrics_path: '/metrics'
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
Grafana Dashboards¶
Setup Grafana¶
# Add to docker-compose.monitoring.yml
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_SERVER_ROOT_URL=https://grafana.readur.company.com
- GF_INSTALL_PLUGINS=redis-datasource
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
networks:
- monitoring
Dashboard Configuration¶
# grafana/provisioning/dashboards/readur.json
{
"dashboard": {
"title": "Readur Performance Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(readur_requests_total[5m])"
}]
},
{
"title": "Response Time",
"targets": [{
"expr": "histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m]))"
}]
},
{
"title": "OCR Queue",
"targets": [{
"expr": "readur_ocr_queue_length"
}]
},
{
"title": "Database Connections",
"targets": [{
"expr": "pg_stat_database_numbackends{datname='readur'}"
}]
}
]
}
}
Application Metrics¶
Custom Metrics Implementation¶
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
# Define metrics
request_count = Counter('readur_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('readur_request_duration_seconds', 'Request duration')
ocr_queue_length = Gauge('readur_ocr_queue_length', 'OCR queue length')
active_users = Gauge('readur_active_users', 'Active users in last 5 minutes')
document_count = Gauge('readur_documents_total', 'Total documents', ['status'])
# Middleware to track requests
class MetricsMiddleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
path = environ.get('PATH_INFO', '/')
method = environ.get('REQUEST_METHOD', 'GET')
with request_duration.time():
request_count.labels(method=method, endpoint=path).inc()
return self.app(environ, start_response)
# Metrics endpoint
@app.route('/metrics')
def metrics():
# Update gauges
ocr_queue_length.set(redis.llen('ocr_queue'))
active_users.set(get_active_user_count())
document_count.labels(status='processed').set(get_document_count('processed'))
return generate_latest(), 200, {'Content-Type': 'text/plain'}
Log Aggregation¶
Loki Setup¶
# Add to docker-compose.monitoring.yml
loki:
image: grafana/loki:latest
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
command: -config.file=/etc/promtail/promtail-config.yml
networks:
- monitoring
Log Configuration¶
# promtail/promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: readur
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
filters:
- name: label
values: ["com.docker.compose.project=readur"]
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'logstream'
Alerting¶
AlertManager Configuration¶
# alertmanager/config.yml
global:
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.company.com:587'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-admins'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'team-admins'
receivers:
- name: 'team-admins'
email_configs:
- to: '[email protected]'
headers:
Subject: 'Readur Alert: {{ .GroupLabels.alertname }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
Alert Rules¶
# prometheus/alerts/readur.yml
groups:
- name: readur
rules:
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time on {{ $labels.instance }}"
description: "95th percentile response time is {{ $value }}s"
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
description: "PostgreSQL database is not responding"
- alert: HighOCRQueue
expr: readur_ocr_queue_length > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "OCR queue backlog"
description: "OCR queue has {{ $value }} pending items"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
Performance Monitoring¶
APM Integration¶
# apm_config.py
from elasticapm import Client
# Configure APM
apm_client = Client({
'SERVICE_NAME': 'readur',
'SERVER_URL': 'http://apm-server:8200',
'ENVIRONMENT': 'production',
'SECRET_TOKEN': 'your-secret-token',
})
# Instrument Flask app
from elasticapm.contrib.flask import ElasticAPM
apm = ElasticAPM(app, client=apm_client)
Custom Performance Metrics¶
# performance_metrics.py
import time
from contextlib import contextmanager
@contextmanager
def track_performance(operation_name):
start_time = time.time()
try:
yield
finally:
duration = time.time() - start_time
metrics.record_operation_time(operation_name, duration)
if duration > 1.0: # Log slow operations
logger.warning(f"Slow operation: {operation_name} took {duration:.2f}s")
# Usage
with track_performance("document_processing"):
process_document(doc_id)
Uptime Monitoring¶
External Monitoring¶
# uptime-kuma/docker-compose.yml
version: '3.8'
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
container_name: uptime-kuma
volumes:
- uptime-kuma_data:/app/data
ports:
- "3001:3001"
restart: unless-stopped
volumes:
uptime-kuma_data:
Status Page Configuration¶
# Public status page
server {
listen 443 ssl;
server_name status.readur.company.com;
location / {
proxy_pass http://localhost:3001;
proxy_set_header Host $host;
}
}
Dashboard Examples¶
Key Metrics Dashboard¶
-- Query for document processing stats
SELECT
DATE(created_at) as date,
COUNT(*) as documents_processed,
AVG(processing_time) as avg_processing_time,
MAX(processing_time) as max_processing_time
FROM documents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
Real-time Monitoring¶
// WebSocket monitoring dashboard
const ws = new WebSocket('wss://readur.company.com/ws/metrics');
ws.onmessage = (event) => {
const metrics = JSON.parse(event.data);
updateDashboard({
activeUsers: metrics.active_users,
queueLength: metrics.queue_length,
responseTime: metrics.response_time,
errorRate: metrics.error_rate
});
};
Troubleshooting Monitoring Issues¶
Prometheus Not Scraping¶
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Verify metrics endpoint
curl http://localhost:8000/metrics
# Check network connectivity
docker network inspect monitoring
Missing Metrics¶
# Debug metric collection
docker-compose exec readur python -c "
from prometheus_client import REGISTRY
for collector in REGISTRY._collector_to_names:
print(collector)
"
High Memory Usage¶
# Check Prometheus storage
du -sh /var/lib/prometheus
# Reduce retention
docker-compose exec prometheus promtool tsdb analyze /prometheus
# Clean old data
docker-compose exec prometheus promtool tsdb clean /prometheus
Best Practices¶
Monitoring Strategy¶
- Start Simple: Begin with basic health checks and expand
- Alert Fatigue: Only alert on actionable issues
- SLI/SLO Definition: Define and track service level indicators
- Dashboard Organization: Create role-specific dashboards
- Log Retention: Balance storage costs with debugging needs
- Security: Protect monitoring endpoints and dashboards
- Documentation: Document alert runbooks and response procedures
Maintenance¶
# Weekly maintenance tasks
#!/bin/bash
# Rotate logs
docker-compose exec readur logrotate -f /etc/logrotate.conf
# Clean up old metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
# Backup Grafana dashboards
docker-compose exec grafana grafana-cli admin export-dashboard
# Update monitoring stack
docker-compose -f docker-compose.monitoring.yml pull
docker-compose -f docker-compose.monitoring.yml up -d