Adolfo Delorenzo 92b7c41f61 Implement proper error logging - Critical for debugging production issues

This comprehensive error logging system provides structured logging, automatic rotation, and detailed tracking for production debugging.

Key features:
- Structured JSON logging for easy parsing and analysis
- Multiple log streams: app, errors, access, security, performance
- Automatic log rotation to prevent disk space exhaustion
- Request tracing with unique IDs for debugging
- Performance metrics collection with slow request tracking
- Security event logging for suspicious activities
- Error deduplication and frequency tracking
- Full exception details with stack traces

Implementation details:
- StructuredFormatter outputs JSON-formatted logs
- ErrorLogger manages multiple specialized loggers
- Rotating file handlers prevent disk space issues
- Request context automatically included in logs
- Performance decorator tracks execution times
- Security events logged for audit trails
- Admin endpoints for log analysis

Admin endpoints:
- GET /admin/logs/errors - View recent errors and frequencies
- GET /admin/logs/performance - View performance metrics
- GET /admin/logs/security - View security events

Log types:
- talk2me.log - General application logs
- errors.log - Dedicated error logging with stack traces
- access.log - HTTP request/response logs
- security.log - Security events and suspicious activities
- performance.log - Performance metrics and timing

This provides production-grade observability critical for debugging issues, monitoring performance, and maintaining security in production environments.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-06-03 08:11:26 -06:00

9.7 KiB

Raw Blame History

Error Logging Documentation

This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues.

Overview

Talk2Me implements a structured logging system that provides:

JSON-formatted structured logs for easy parsing
Multiple log streams (app, errors, access, security, performance)
Automatic log rotation to prevent disk space issues
Request tracing with unique IDs
Performance metrics collection
Security event tracking
Error deduplication and frequency tracking

Log Types

1. Application Logs (`logs/talk2me.log`)

General application logs including info, warnings, and debug messages.

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "INFO",
  "logger": "talk2me",
  "message": "Whisper model loaded successfully",
  "app": "talk2me",
  "environment": "production",
  "hostname": "server-1",
  "thread": "MainThread",
  "process": 12345
}

2. Error Logs (`logs/errors.log`)

Dedicated error logging with full exception details and stack traces.

{
  "timestamp": "2024-01-15T10:31:00.456Z",
  "level": "ERROR",
  "logger": "talk2me.errors",
  "message": "Error in transcribe: File too large",
  "exception": {
    "type": "ValueError",
    "message": "Audio file exceeds maximum size",
    "traceback": ["...full stack trace..."]
  },
  "request_id": "1234567890-abcdef",
  "endpoint": "transcribe",
  "method": "POST",
  "path": "/transcribe",
  "ip": "192.168.1.100"
}

3. Access Logs (`logs/access.log`)

HTTP request/response logging for traffic analysis.

{
  "timestamp": "2024-01-15T10:32:00.789Z",
  "level": "INFO",
  "message": "request_complete",
  "request_id": "1234567890-abcdef",
  "method": "POST",
  "path": "/transcribe",
  "status": 200,
  "duration_ms": 1250,
  "content_length": 4096,
  "ip": "192.168.1.100",
  "user_agent": "Mozilla/5.0..."
}

4. Security Logs (`logs/security.log`)

Security-related events and suspicious activities.

{
  "timestamp": "2024-01-15T10:33:00.123Z",
  "level": "WARNING",
  "message": "Security event: rate_limit_exceeded",
  "event": "rate_limit_exceeded",
  "severity": "warning",
  "ip": "192.168.1.100",
  "endpoint": "/transcribe",
  "attempts": 15,
  "blocked": true
}

5. Performance Logs (`logs/performance.log`)

Performance metrics and slow request tracking.

{
  "timestamp": "2024-01-15T10:34:00.456Z",
  "level": "INFO",
  "message": "Performance metric: transcribe_audio",
  "metric": "transcribe_audio",
  "duration_ms": 2500,
  "function": "transcribe",
  "module": "app",
  "request_id": "1234567890-abcdef"
}

Configuration

Environment Variables

# Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=INFO

# Log file paths
export LOG_FILE=logs/talk2me.log
export ERROR_LOG_FILE=logs/errors.log

# Log rotation settings
export LOG_MAX_BYTES=52428800      # 50MB
export LOG_BACKUP_COUNT=10         # Keep 10 backup files

# Environment
export FLASK_ENV=production

Flask Configuration

app.config.update({
    'LOG_LEVEL': 'INFO',
    'LOG_FILE': 'logs/talk2me.log',
    'ERROR_LOG_FILE': 'logs/errors.log',
    'LOG_MAX_BYTES': 50 * 1024 * 1024,
    'LOG_BACKUP_COUNT': 10
})

Admin API Endpoints

GET /admin/logs/errors

View recent error logs and error frequency statistics.

curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors

Response:

{
  "error_summary": {
    "abc123def456": {
      "count_last_hour": 5,
      "last_seen": 1705320000
    }
  },
  "recent_errors": [...],
  "total_errors_logged": 150
}

GET /admin/logs/performance

View performance metrics and slow requests.

curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance

Response:

{
  "performance_metrics": {
    "transcribe_audio": {
      "avg_ms": 850.5,
      "max_ms": 3200,
      "min_ms": 125,
      "count": 1024
    }
  },
  "slow_requests": [
    {
      "metric": "transcribe_audio",
      "duration_ms": 3200,
      "timestamp": "2024-01-15T10:35:00Z"
    }
  ]
}

GET /admin/logs/security

View security events and suspicious activities.

curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security

Response:

{
  "security_events": [...],
  "event_summary": {
    "rate_limit_exceeded": 25,
    "suspicious_error": 3,
    "high_error_rate": 1
  },
  "total_events": 29
}

Usage Patterns

1. Logging Errors with Context

from error_logger import log_exception

try:
    # Some operation
    process_audio(file)
except Exception as e:
    log_exception(
        e,
        message="Failed to process audio",
        user_id=user.id,
        file_size=file.size,
        file_type=file.content_type
    )

2. Performance Monitoring

from error_logger import log_performance

@log_performance('expensive_operation')
def process_large_file(file):
    # This will automatically log execution time
    return processed_data

3. Security Event Logging

app.error_logger.log_security(
    'unauthorized_access',
    severity='warning',
    ip=request.remote_addr,
    attempted_resource='/admin',
    user_agent=request.headers.get('User-Agent')
)

4. Request Context

from error_logger import log_context

with log_context(user_id=user.id, feature='translation'):
    # All logs within this context will include user_id and feature
    translate_text(text)

Log Analysis

Finding Specific Errors

# Find all authentication errors
grep '"error_type":"AuthenticationError"' logs/errors.log | jq .

# Find errors from specific IP
grep '"ip":"192.168.1.100"' logs/errors.log | jq .

# Find errors in last hour
grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq .

Performance Analysis

# Find slow requests (>2000ms)
jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log

# Calculate average response time for endpoint
jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}'

Security Monitoring

# Count security events by type
jq '.extra_fields.event' logs/security.log | sort | uniq -c

# Find all blocked IPs
jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u

Log Rotation

Logs are automatically rotated based on size or time:

Application/Error logs: Rotate at 50MB, keep 10 backups
Access logs: Daily rotation, keep 30 days
Performance logs: Hourly rotation, keep 7 days
Security logs: Rotate at 50MB, keep 10 backups

Rotated logs are named with numeric suffixes:

talk2me.log (current)
talk2me.log.1 (most recent backup)
talk2me.log.2 (older backup)
etc.

Best Practices

1. Structured Logging

Always include relevant context:

logger.info("User action completed", extra={
    'extra_fields': {
        'user_id': user.id,
        'action': 'upload_audio',
        'file_size': file.size,
        'duration_ms': processing_time
    }
})

2. Error Handling

Log errors at appropriate levels:

try:
    result = risky_operation()
except ValidationError as e:
    logger.warning(f"Validation failed: {e}")  # Expected errors
except Exception as e:
    logger.error(f"Unexpected error: {e}", exc_info=True)  # Unexpected errors

3. Performance Tracking

Track key operations:

start = time.time()
result = expensive_operation()
duration = (time.time() - start) * 1000

app.error_logger.log_performance(
    'expensive_operation',
    value=duration,
    input_size=len(data),
    output_size=len(result)
)

4. Security Awareness

Log security-relevant events:

if failed_attempts > 3:
    app.error_logger.log_security(
        'multiple_failed_attempts',
        severity='warning',
        ip=request.remote_addr,
        attempts=failed_attempts
    )

Monitoring Integration

Prometheus Metrics

Export log metrics for Prometheus:

@app.route('/metrics')
def prometheus_metrics():
    error_summary = app.error_logger.get_error_summary()
    # Format as Prometheus metrics
    return format_prometheus_metrics(error_summary)

ELK Stack

Ship logs to Elasticsearch:

filebeat.inputs:
- type: log
  paths:
    - /app/logs/*.log
  json.keys_under_root: true
  json.add_error_key: true

CloudWatch

For AWS deployments:

# Install boto3 and watchtower
import watchtower
cloudwatch_handler = watchtower.CloudWatchLogHandler()
logger.addHandler(cloudwatch_handler)

Troubleshooting

Common Issues

1. Logs Not Being Written

Check permissions:

ls -la logs/
# Should show write permissions for app user

Create logs directory:

mkdir -p logs
chmod 755 logs

2. Disk Space Issues

Monitor log sizes:

du -sh logs/*

Force rotation:

# Manually rotate logs
mv logs/talk2me.log logs/talk2me.log.backup
# App will create new log file

3. Performance Impact

If logging impacts performance:

Increase LOG_LEVEL to WARNING or ERROR
Reduce backup count
Use asynchronous logging (future enhancement)

Security Considerations

Log Sanitization: Sensitive data is automatically masked
Access Control: Admin endpoints require authentication
Log Retention: Old logs are automatically deleted
Encryption: Consider encrypting logs at rest in production
Audit Trail: All log access is itself logged

Future Enhancements

Centralized Logging: Ship logs to centralized service
Real-time Alerts: Trigger alerts on error patterns
Log Analytics: Built-in log analysis dashboard
Correlation IDs: Track requests across microservices
Async Logging: Reduce performance impact

9.7 KiB Raw Blame History

Error Logging Documentation

Overview

Log Types

1. Application Logs (logs/talk2me.log)

2. Error Logs (logs/errors.log)

3. Access Logs (logs/access.log)

4. Security Logs (logs/security.log)

5. Performance Logs (logs/performance.log)

Configuration

Environment Variables

Flask Configuration

Admin API Endpoints

GET /admin/logs/errors

GET /admin/logs/performance

GET /admin/logs/security

Usage Patterns

1. Logging Errors with Context

2. Performance Monitoring

3. Security Event Logging

4. Request Context

Log Analysis

Finding Specific Errors

Performance Analysis

Security Monitoring

Log Rotation

Best Practices

1. Structured Logging

2. Error Handling

3. Performance Tracking

4. Security Awareness

Monitoring Integration

Prometheus Metrics

ELK Stack

CloudWatch

Troubleshooting

Common Issues

1. Logs Not Being Written

2. Disk Space Issues

3. Performance Impact

Security Considerations

Future Enhancements

9.7 KiB

Raw Blame History

1. Application Logs (`logs/talk2me.log`)

2. Error Logs (`logs/errors.log`)

3. Access Logs (`logs/access.log`)

4. Security Logs (`logs/security.log`)

5. Performance Logs (`logs/performance.log`)