Adolfo Delorenzo 92fd390866 Add production WSGI server - Flask dev server unsuitable for production load

This adds a complete production deployment setup using Gunicorn as the WSGI server, replacing Flask's development server.

Key components:
- Gunicorn configuration with optimized worker settings
- Support for sync, threaded, and async (gevent) workers
- Automatic worker recycling to prevent memory leaks
- Increased timeouts for audio processing
- Production-ready logging and monitoring

Deployment options:
1. Docker/Docker Compose for containerized deployment
2. Systemd service for traditional deployment
3. Nginx reverse proxy configuration
4. SSL/TLS support

Production features:
- wsgi.py entry point for WSGI servers
- gunicorn_config.py with production settings
- Dockerfile with multi-stage build
- docker-compose.yml with full stack (Redis, PostgreSQL)
- nginx.conf with caching and security headers
- systemd service with security hardening
- deploy.sh automated deployment script

Configuration:
- .env.production template with all settings
- Support for environment-based configuration
- Separate requirements-prod.txt
- Prometheus metrics endpoint (/metrics)

Monitoring:
- Health check endpoints for liveness/readiness
- Prometheus-compatible metrics
- Structured logging
- Memory usage tracking
- Request counting

Security:
- Non-root user in Docker
- Systemd security restrictions
- Nginx security headers
- File permission hardening
- Resource limits

Documentation:
- Comprehensive PRODUCTION_DEPLOYMENT.md
- Scaling strategies
- Performance tuning guide
- Troubleshooting section

Also fixed memory_manager.py GC stats collection error.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-06-03 08:49:32 -06:00

7.6 KiB

Raw Blame History

Production Deployment Guide

This guide covers deploying Talk2Me in a production environment using a proper WSGI server.

Overview

The Flask development server is not suitable for production use. This guide covers:

Gunicorn as the WSGI server
Nginx as a reverse proxy
Docker for containerization
Systemd for process management
Security best practices

Quick Start with Docker

1. Using Docker Compose

# Clone the repository
git clone https://github.com/your-repo/talk2me.git
cd talk2me

# Create .env file with production settings
cat > .env <<EOF
TTS_API_KEY=your-api-key
ADMIN_TOKEN=your-secure-admin-token
SECRET_KEY=your-secure-secret-key
POSTGRES_PASSWORD=your-secure-db-password
EOF

# Build and start services
docker-compose up -d

# Check status
docker-compose ps
docker-compose logs -f talk2me

2. Using Docker (standalone)

# Build the image
docker build -t talk2me .

# Run the container
docker run -d \
  --name talk2me \
  -p 5005:5005 \
  -e TTS_API_KEY=your-api-key \
  -e ADMIN_TOKEN=your-secure-token \
  -e SECRET_KEY=your-secure-key \
  -v $(pwd)/logs:/app/logs \
  talk2me

Manual Deployment

1. System Requirements

Ubuntu 20.04+ or similar Linux distribution
Python 3.8+
Nginx
Systemd
4GB+ RAM recommended
GPU (optional, for faster transcription)

2. Installation

Run the deployment script as root:

sudo ./deploy.sh

Or manually:

# Install system dependencies
sudo apt-get update
sudo apt-get install -y python3-pip python3-venv nginx

# Create application user
sudo useradd -m -s /bin/bash talk2me

# Create directories
sudo mkdir -p /opt/talk2me /var/log/talk2me
sudo chown talk2me:talk2me /opt/talk2me /var/log/talk2me

# Copy application files
sudo cp -r . /opt/talk2me/
sudo chown -R talk2me:talk2me /opt/talk2me

# Install Python dependencies
sudo -u talk2me python3 -m venv /opt/talk2me/venv
sudo -u talk2me /opt/talk2me/venv/bin/pip install -r requirements-prod.txt

# Configure and start services
sudo cp talk2me.service /etc/systemd/system/
sudo systemctl enable talk2me
sudo systemctl start talk2me

Gunicorn Configuration

The gunicorn_config.py file contains production-ready settings:

Worker Configuration

# Number of worker processes
workers = multiprocessing.cpu_count() * 2 + 1

# Worker timeout (increased for audio processing)
timeout = 120

# Restart workers periodically to prevent memory leaks
max_requests = 1000
max_requests_jitter = 50

Performance Tuning

For different workloads:

# CPU-bound (transcription heavy)
export GUNICORN_WORKERS=8
export GUNICORN_THREADS=1

# I/O-bound (many concurrent requests)
export GUNICORN_WORKERS=4
export GUNICORN_THREADS=4
export GUNICORN_WORKER_CLASS=gthread

# Async (best concurrency)
export GUNICORN_WORKER_CLASS=gevent
export GUNICORN_WORKER_CONNECTIONS=1000

Nginx Configuration

Basic Setup

The provided nginx.conf includes:

Reverse proxy to Gunicorn
Static file serving
WebSocket support
Security headers
Gzip compression

SSL/TLS Setup

server {
    listen 443 ssl http2;
    server_name your-domain.com;
    
    ssl_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;
    
    # Strong SSL configuration
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers off;
    
    # HSTS
    add_header Strict-Transport-Security "max-age=63072000" always;
}

Environment Variables

Required

# Security
SECRET_KEY=your-very-secure-secret-key
ADMIN_TOKEN=your-admin-api-token

# TTS Configuration
TTS_API_KEY=your-tts-api-key
TTS_SERVER_URL=http://your-tts-server:5050/v1/audio/speech

# Flask
FLASK_ENV=production

Optional

# Performance
GUNICORN_WORKERS=4
GUNICORN_THREADS=2
MEMORY_THRESHOLD_MB=4096
GPU_MEMORY_THRESHOLD_MB=2048

# Database (for session storage)
DATABASE_URL=postgresql://user:pass@localhost/talk2me
REDIS_URL=redis://localhost:6379/0

# Monitoring
SENTRY_DSN=your-sentry-dsn

Monitoring

Health Checks

# Basic health check
curl http://localhost:5005/health

# Detailed health check
curl http://localhost:5005/health/detailed

# Memory usage
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/memory

Logs

# Application logs
tail -f /var/log/talk2me/talk2me.log

# Error logs
tail -f /var/log/talk2me/errors.log

# Gunicorn logs
journalctl -u talk2me -f

# Nginx logs
tail -f /var/log/nginx/access.log
tail -f /var/log/nginx/error.log

Metrics

With Prometheus client installed:

# Prometheus metrics endpoint
curl http://localhost:5005/metrics

Scaling

Horizontal Scaling

For multiple servers:

Use Redis for session storage
Use PostgreSQL for persistent data
Load balance with Nginx:

upstream talk2me_backends {
    least_conn;
    server server1:5005 weight=1;
    server server2:5005 weight=1;
    server server3:5005 weight=1;
}

Vertical Scaling

Adjust based on load:

# High memory usage
MEMORY_THRESHOLD_MB=8192
GPU_MEMORY_THRESHOLD_MB=4096

# More workers
GUNICORN_WORKERS=16
GUNICORN_THREADS=4

# Larger file limits
client_max_body_size 100M;

Security

Firewall

# Allow only necessary ports
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 22/tcp
sudo ufw enable

File Permissions

# Secure file permissions
sudo chmod 750 /opt/talk2me
sudo chmod 640 /opt/talk2me/.env
sudo chmod 755 /opt/talk2me/static

AppArmor/SELinux

Create security profiles to restrict application access.

Backup

Database Backup

# PostgreSQL
pg_dump talk2me > backup.sql

# Redis
redis-cli BGSAVE

Application Backup

# Backup application and logs
tar -czf talk2me-backup.tar.gz \
  /opt/talk2me \
  /var/log/talk2me \
  /etc/systemd/system/talk2me.service \
  /etc/nginx/sites-available/talk2me

Troubleshooting

Service Won't Start

# Check service status
systemctl status talk2me

# Check logs
journalctl -u talk2me -n 100

# Test configuration
sudo -u talk2me /opt/talk2me/venv/bin/gunicorn --check-config wsgi:application

High Memory Usage

# Trigger cleanup
curl -X POST -H "X-Admin-Token: token" http://localhost:5005/admin/memory/cleanup

# Restart workers
systemctl reload talk2me

Slow Response Times

Check worker count
Enable async workers
Check GPU availability
Review nginx buffering settings

Performance Optimization

1. Enable GPU

Ensure CUDA/ROCm is properly installed:

# Check GPU
nvidia-smi  # or rocm-smi

# Set in environment
export CUDA_VISIBLE_DEVICES=0

2. Optimize Workers

# For CPU-heavy workloads
workers = cpu_count()
threads = 1

# For I/O-heavy workloads
workers = cpu_count() * 2
threads = 4

3. Enable Caching

Use Redis for caching translations:

CACHE_TYPE = 'redis'
CACHE_REDIS_URL = 'redis://localhost:6379/0'

Maintenance

Regular Tasks

Log Rotation: Configured automatically
Database Cleanup: Run weekly
Model Updates: Check for Whisper updates
Security Updates: Keep dependencies updated

Update Procedure

# Backup first
./backup.sh

# Update code
git pull

# Update dependencies
sudo -u talk2me /opt/talk2me/venv/bin/pip install -r requirements-prod.txt

# Restart service
sudo systemctl restart talk2me

Rollback

If deployment fails:

# Stop service
sudo systemctl stop talk2me

# Restore backup
tar -xzf talk2me-backup.tar.gz -C /

# Restart service
sudo systemctl start talk2me

7.6 KiB Raw Blame History