talk2me/ERROR_LOGGING.md
Adolfo Delorenzo 92b7c41f61 Implement proper error logging - Critical for debugging production issues
This comprehensive error logging system provides structured logging, automatic rotation, and detailed tracking for production debugging.

Key features:
- Structured JSON logging for easy parsing and analysis
- Multiple log streams: app, errors, access, security, performance
- Automatic log rotation to prevent disk space exhaustion
- Request tracing with unique IDs for debugging
- Performance metrics collection with slow request tracking
- Security event logging for suspicious activities
- Error deduplication and frequency tracking
- Full exception details with stack traces

Implementation details:
- StructuredFormatter outputs JSON-formatted logs
- ErrorLogger manages multiple specialized loggers
- Rotating file handlers prevent disk space issues
- Request context automatically included in logs
- Performance decorator tracks execution times
- Security events logged for audit trails
- Admin endpoints for log analysis

Admin endpoints:
- GET /admin/logs/errors - View recent errors and frequencies
- GET /admin/logs/performance - View performance metrics
- GET /admin/logs/security - View security events

Log types:
- talk2me.log - General application logs
- errors.log - Dedicated error logging with stack traces
- access.log - HTTP request/response logs
- security.log - Security events and suspicious activities
- performance.log - Performance metrics and timing

This provides production-grade observability critical for debugging issues, monitoring performance, and maintaining security in production environments.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-03 08:11:26 -06:00

9.7 KiB

Error Logging Documentation

This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues.

Overview

Talk2Me implements a structured logging system that provides:

  • JSON-formatted structured logs for easy parsing
  • Multiple log streams (app, errors, access, security, performance)
  • Automatic log rotation to prevent disk space issues
  • Request tracing with unique IDs
  • Performance metrics collection
  • Security event tracking
  • Error deduplication and frequency tracking

Log Types

1. Application Logs (logs/talk2me.log)

General application logs including info, warnings, and debug messages.

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "INFO",
  "logger": "talk2me",
  "message": "Whisper model loaded successfully",
  "app": "talk2me",
  "environment": "production",
  "hostname": "server-1",
  "thread": "MainThread",
  "process": 12345
}

2. Error Logs (logs/errors.log)

Dedicated error logging with full exception details and stack traces.

{
  "timestamp": "2024-01-15T10:31:00.456Z",
  "level": "ERROR",
  "logger": "talk2me.errors",
  "message": "Error in transcribe: File too large",
  "exception": {
    "type": "ValueError",
    "message": "Audio file exceeds maximum size",
    "traceback": ["...full stack trace..."]
  },
  "request_id": "1234567890-abcdef",
  "endpoint": "transcribe",
  "method": "POST",
  "path": "/transcribe",
  "ip": "192.168.1.100"
}

3. Access Logs (logs/access.log)

HTTP request/response logging for traffic analysis.

{
  "timestamp": "2024-01-15T10:32:00.789Z",
  "level": "INFO",
  "message": "request_complete",
  "request_id": "1234567890-abcdef",
  "method": "POST",
  "path": "/transcribe",
  "status": 200,
  "duration_ms": 1250,
  "content_length": 4096,
  "ip": "192.168.1.100",
  "user_agent": "Mozilla/5.0..."
}

4. Security Logs (logs/security.log)

Security-related events and suspicious activities.

{
  "timestamp": "2024-01-15T10:33:00.123Z",
  "level": "WARNING",
  "message": "Security event: rate_limit_exceeded",
  "event": "rate_limit_exceeded",
  "severity": "warning",
  "ip": "192.168.1.100",
  "endpoint": "/transcribe",
  "attempts": 15,
  "blocked": true
}

5. Performance Logs (logs/performance.log)

Performance metrics and slow request tracking.

{
  "timestamp": "2024-01-15T10:34:00.456Z",
  "level": "INFO",
  "message": "Performance metric: transcribe_audio",
  "metric": "transcribe_audio",
  "duration_ms": 2500,
  "function": "transcribe",
  "module": "app",
  "request_id": "1234567890-abcdef"
}

Configuration

Environment Variables

# Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=INFO

# Log file paths
export LOG_FILE=logs/talk2me.log
export ERROR_LOG_FILE=logs/errors.log

# Log rotation settings
export LOG_MAX_BYTES=52428800      # 50MB
export LOG_BACKUP_COUNT=10         # Keep 10 backup files

# Environment
export FLASK_ENV=production

Flask Configuration

app.config.update({
    'LOG_LEVEL': 'INFO',
    'LOG_FILE': 'logs/talk2me.log',
    'ERROR_LOG_FILE': 'logs/errors.log',
    'LOG_MAX_BYTES': 50 * 1024 * 1024,
    'LOG_BACKUP_COUNT': 10
})

Admin API Endpoints

GET /admin/logs/errors

View recent error logs and error frequency statistics.

curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors

Response:

{
  "error_summary": {
    "abc123def456": {
      "count_last_hour": 5,
      "last_seen": 1705320000
    }
  },
  "recent_errors": [...],
  "total_errors_logged": 150
}

GET /admin/logs/performance

View performance metrics and slow requests.

curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance

Response:

{
  "performance_metrics": {
    "transcribe_audio": {
      "avg_ms": 850.5,
      "max_ms": 3200,
      "min_ms": 125,
      "count": 1024
    }
  },
  "slow_requests": [
    {
      "metric": "transcribe_audio",
      "duration_ms": 3200,
      "timestamp": "2024-01-15T10:35:00Z"
    }
  ]
}

GET /admin/logs/security

View security events and suspicious activities.

curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security

Response:

{
  "security_events": [...],
  "event_summary": {
    "rate_limit_exceeded": 25,
    "suspicious_error": 3,
    "high_error_rate": 1
  },
  "total_events": 29
}

Usage Patterns

1. Logging Errors with Context

from error_logger import log_exception

try:
    # Some operation
    process_audio(file)
except Exception as e:
    log_exception(
        e,
        message="Failed to process audio",
        user_id=user.id,
        file_size=file.size,
        file_type=file.content_type
    )

2. Performance Monitoring

from error_logger import log_performance

@log_performance('expensive_operation')
def process_large_file(file):
    # This will automatically log execution time
    return processed_data

3. Security Event Logging

app.error_logger.log_security(
    'unauthorized_access',
    severity='warning',
    ip=request.remote_addr,
    attempted_resource='/admin',
    user_agent=request.headers.get('User-Agent')
)

4. Request Context

from error_logger import log_context

with log_context(user_id=user.id, feature='translation'):
    # All logs within this context will include user_id and feature
    translate_text(text)

Log Analysis

Finding Specific Errors

# Find all authentication errors
grep '"error_type":"AuthenticationError"' logs/errors.log | jq .

# Find errors from specific IP
grep '"ip":"192.168.1.100"' logs/errors.log | jq .

# Find errors in last hour
grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq .

Performance Analysis

# Find slow requests (>2000ms)
jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log

# Calculate average response time for endpoint
jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}'

Security Monitoring

# Count security events by type
jq '.extra_fields.event' logs/security.log | sort | uniq -c

# Find all blocked IPs
jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u

Log Rotation

Logs are automatically rotated based on size or time:

  • Application/Error logs: Rotate at 50MB, keep 10 backups
  • Access logs: Daily rotation, keep 30 days
  • Performance logs: Hourly rotation, keep 7 days
  • Security logs: Rotate at 50MB, keep 10 backups

Rotated logs are named with numeric suffixes:

  • talk2me.log (current)
  • talk2me.log.1 (most recent backup)
  • talk2me.log.2 (older backup)
  • etc.

Best Practices

1. Structured Logging

Always include relevant context:

logger.info("User action completed", extra={
    'extra_fields': {
        'user_id': user.id,
        'action': 'upload_audio',
        'file_size': file.size,
        'duration_ms': processing_time
    }
})

2. Error Handling

Log errors at appropriate levels:

try:
    result = risky_operation()
except ValidationError as e:
    logger.warning(f"Validation failed: {e}")  # Expected errors
except Exception as e:
    logger.error(f"Unexpected error: {e}", exc_info=True)  # Unexpected errors

3. Performance Tracking

Track key operations:

start = time.time()
result = expensive_operation()
duration = (time.time() - start) * 1000

app.error_logger.log_performance(
    'expensive_operation',
    value=duration,
    input_size=len(data),
    output_size=len(result)
)

4. Security Awareness

Log security-relevant events:

if failed_attempts > 3:
    app.error_logger.log_security(
        'multiple_failed_attempts',
        severity='warning',
        ip=request.remote_addr,
        attempts=failed_attempts
    )

Monitoring Integration

Prometheus Metrics

Export log metrics for Prometheus:

@app.route('/metrics')
def prometheus_metrics():
    error_summary = app.error_logger.get_error_summary()
    # Format as Prometheus metrics
    return format_prometheus_metrics(error_summary)

ELK Stack

Ship logs to Elasticsearch:

filebeat.inputs:
- type: log
  paths:
    - /app/logs/*.log
  json.keys_under_root: true
  json.add_error_key: true

CloudWatch

For AWS deployments:

# Install boto3 and watchtower
import watchtower
cloudwatch_handler = watchtower.CloudWatchLogHandler()
logger.addHandler(cloudwatch_handler)

Troubleshooting

Common Issues

1. Logs Not Being Written

Check permissions:

ls -la logs/
# Should show write permissions for app user

Create logs directory:

mkdir -p logs
chmod 755 logs

2. Disk Space Issues

Monitor log sizes:

du -sh logs/*

Force rotation:

# Manually rotate logs
mv logs/talk2me.log logs/talk2me.log.backup
# App will create new log file

3. Performance Impact

If logging impacts performance:

  • Increase LOG_LEVEL to WARNING or ERROR
  • Reduce backup count
  • Use asynchronous logging (future enhancement)

Security Considerations

  1. Log Sanitization: Sensitive data is automatically masked
  2. Access Control: Admin endpoints require authentication
  3. Log Retention: Old logs are automatically deleted
  4. Encryption: Consider encrypting logs at rest in production
  5. Audit Trail: All log access is itself logged

Future Enhancements

  1. Centralized Logging: Ship logs to centralized service
  2. Real-time Alerts: Trigger alerts on error patterns
  3. Log Analytics: Built-in log analysis dashboard
  4. Correlation IDs: Track requests across microservices
  5. Async Logging: Reduce performance impact