# Error Logging Documentation This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues. ## Overview Talk2Me implements a structured logging system that provides: - JSON-formatted structured logs for easy parsing - Multiple log streams (app, errors, access, security, performance) - Automatic log rotation to prevent disk space issues - Request tracing with unique IDs - Performance metrics collection - Security event tracking - Error deduplication and frequency tracking ## Log Types ### 1. Application Logs (`logs/talk2me.log`) General application logs including info, warnings, and debug messages. ```json { "timestamp": "2024-01-15T10:30:45.123Z", "level": "INFO", "logger": "talk2me", "message": "Whisper model loaded successfully", "app": "talk2me", "environment": "production", "hostname": "server-1", "thread": "MainThread", "process": 12345 } ``` ### 2. Error Logs (`logs/errors.log`) Dedicated error logging with full exception details and stack traces. ```json { "timestamp": "2024-01-15T10:31:00.456Z", "level": "ERROR", "logger": "talk2me.errors", "message": "Error in transcribe: File too large", "exception": { "type": "ValueError", "message": "Audio file exceeds maximum size", "traceback": ["...full stack trace..."] }, "request_id": "1234567890-abcdef", "endpoint": "transcribe", "method": "POST", "path": "/transcribe", "ip": "192.168.1.100" } ``` ### 3. Access Logs (`logs/access.log`) HTTP request/response logging for traffic analysis. ```json { "timestamp": "2024-01-15T10:32:00.789Z", "level": "INFO", "message": "request_complete", "request_id": "1234567890-abcdef", "method": "POST", "path": "/transcribe", "status": 200, "duration_ms": 1250, "content_length": 4096, "ip": "192.168.1.100", "user_agent": "Mozilla/5.0..." } ``` ### 4. Security Logs (`logs/security.log`) Security-related events and suspicious activities. ```json { "timestamp": "2024-01-15T10:33:00.123Z", "level": "WARNING", "message": "Security event: rate_limit_exceeded", "event": "rate_limit_exceeded", "severity": "warning", "ip": "192.168.1.100", "endpoint": "/transcribe", "attempts": 15, "blocked": true } ``` ### 5. Performance Logs (`logs/performance.log`) Performance metrics and slow request tracking. ```json { "timestamp": "2024-01-15T10:34:00.456Z", "level": "INFO", "message": "Performance metric: transcribe_audio", "metric": "transcribe_audio", "duration_ms": 2500, "function": "transcribe", "module": "app", "request_id": "1234567890-abcdef" } ``` ## Configuration ### Environment Variables ```bash # Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) export LOG_LEVEL=INFO # Log file paths export LOG_FILE=logs/talk2me.log export ERROR_LOG_FILE=logs/errors.log # Log rotation settings export LOG_MAX_BYTES=52428800 # 50MB export LOG_BACKUP_COUNT=10 # Keep 10 backup files # Environment export FLASK_ENV=production ``` ### Flask Configuration ```python app.config.update({ 'LOG_LEVEL': 'INFO', 'LOG_FILE': 'logs/talk2me.log', 'ERROR_LOG_FILE': 'logs/errors.log', 'LOG_MAX_BYTES': 50 * 1024 * 1024, 'LOG_BACKUP_COUNT': 10 }) ``` ## Admin API Endpoints ### GET /admin/logs/errors View recent error logs and error frequency statistics. ```bash curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors ``` Response: ```json { "error_summary": { "abc123def456": { "count_last_hour": 5, "last_seen": 1705320000 } }, "recent_errors": [...], "total_errors_logged": 150 } ``` ### GET /admin/logs/performance View performance metrics and slow requests. ```bash curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance ``` Response: ```json { "performance_metrics": { "transcribe_audio": { "avg_ms": 850.5, "max_ms": 3200, "min_ms": 125, "count": 1024 } }, "slow_requests": [ { "metric": "transcribe_audio", "duration_ms": 3200, "timestamp": "2024-01-15T10:35:00Z" } ] } ``` ### GET /admin/logs/security View security events and suspicious activities. ```bash curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security ``` Response: ```json { "security_events": [...], "event_summary": { "rate_limit_exceeded": 25, "suspicious_error": 3, "high_error_rate": 1 }, "total_events": 29 } ``` ## Usage Patterns ### 1. Logging Errors with Context ```python from error_logger import log_exception try: # Some operation process_audio(file) except Exception as e: log_exception( e, message="Failed to process audio", user_id=user.id, file_size=file.size, file_type=file.content_type ) ``` ### 2. Performance Monitoring ```python from error_logger import log_performance @log_performance('expensive_operation') def process_large_file(file): # This will automatically log execution time return processed_data ``` ### 3. Security Event Logging ```python app.error_logger.log_security( 'unauthorized_access', severity='warning', ip=request.remote_addr, attempted_resource='/admin', user_agent=request.headers.get('User-Agent') ) ``` ### 4. Request Context ```python from error_logger import log_context with log_context(user_id=user.id, feature='translation'): # All logs within this context will include user_id and feature translate_text(text) ``` ## Log Analysis ### Finding Specific Errors ```bash # Find all authentication errors grep '"error_type":"AuthenticationError"' logs/errors.log | jq . # Find errors from specific IP grep '"ip":"192.168.1.100"' logs/errors.log | jq . # Find errors in last hour grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq . ``` ### Performance Analysis ```bash # Find slow requests (>2000ms) jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log # Calculate average response time for endpoint jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}' ``` ### Security Monitoring ```bash # Count security events by type jq '.extra_fields.event' logs/security.log | sort | uniq -c # Find all blocked IPs jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u ``` ## Log Rotation Logs are automatically rotated based on size or time: - **Application/Error logs**: Rotate at 50MB, keep 10 backups - **Access logs**: Daily rotation, keep 30 days - **Performance logs**: Hourly rotation, keep 7 days - **Security logs**: Rotate at 50MB, keep 10 backups Rotated logs are named with numeric suffixes: - `talk2me.log` (current) - `talk2me.log.1` (most recent backup) - `talk2me.log.2` (older backup) - etc. ## Best Practices ### 1. Structured Logging Always include relevant context: ```python logger.info("User action completed", extra={ 'extra_fields': { 'user_id': user.id, 'action': 'upload_audio', 'file_size': file.size, 'duration_ms': processing_time } }) ``` ### 2. Error Handling Log errors at appropriate levels: ```python try: result = risky_operation() except ValidationError as e: logger.warning(f"Validation failed: {e}") # Expected errors except Exception as e: logger.error(f"Unexpected error: {e}", exc_info=True) # Unexpected errors ``` ### 3. Performance Tracking Track key operations: ```python start = time.time() result = expensive_operation() duration = (time.time() - start) * 1000 app.error_logger.log_performance( 'expensive_operation', value=duration, input_size=len(data), output_size=len(result) ) ``` ### 4. Security Awareness Log security-relevant events: ```python if failed_attempts > 3: app.error_logger.log_security( 'multiple_failed_attempts', severity='warning', ip=request.remote_addr, attempts=failed_attempts ) ``` ## Monitoring Integration ### Prometheus Metrics Export log metrics for Prometheus: ```python @app.route('/metrics') def prometheus_metrics(): error_summary = app.error_logger.get_error_summary() # Format as Prometheus metrics return format_prometheus_metrics(error_summary) ``` ### ELK Stack Ship logs to Elasticsearch: ```yaml filebeat.inputs: - type: log paths: - /app/logs/*.log json.keys_under_root: true json.add_error_key: true ``` ### CloudWatch For AWS deployments: ```python # Install boto3 and watchtower import watchtower cloudwatch_handler = watchtower.CloudWatchLogHandler() logger.addHandler(cloudwatch_handler) ``` ## Troubleshooting ### Common Issues #### 1. Logs Not Being Written Check permissions: ```bash ls -la logs/ # Should show write permissions for app user ``` Create logs directory: ```bash mkdir -p logs chmod 755 logs ``` #### 2. Disk Space Issues Monitor log sizes: ```bash du -sh logs/* ``` Force rotation: ```bash # Manually rotate logs mv logs/talk2me.log logs/talk2me.log.backup # App will create new log file ``` #### 3. Performance Impact If logging impacts performance: - Increase LOG_LEVEL to WARNING or ERROR - Reduce backup count - Use asynchronous logging (future enhancement) ## Security Considerations 1. **Log Sanitization**: Sensitive data is automatically masked 2. **Access Control**: Admin endpoints require authentication 3. **Log Retention**: Old logs are automatically deleted 4. **Encryption**: Consider encrypting logs at rest in production 5. **Audit Trail**: All log access is itself logged ## Future Enhancements 1. **Centralized Logging**: Ship logs to centralized service 2. **Real-time Alerts**: Trigger alerts on error patterns 3. **Log Analytics**: Built-in log analysis dashboard 4. **Correlation IDs**: Track requests across microservices 5. **Async Logging**: Reduce performance impact