This comprehensive error logging system provides structured logging, automatic rotation, and detailed tracking for production debugging. Key features: - Structured JSON logging for easy parsing and analysis - Multiple log streams: app, errors, access, security, performance - Automatic log rotation to prevent disk space exhaustion - Request tracing with unique IDs for debugging - Performance metrics collection with slow request tracking - Security event logging for suspicious activities - Error deduplication and frequency tracking - Full exception details with stack traces Implementation details: - StructuredFormatter outputs JSON-formatted logs - ErrorLogger manages multiple specialized loggers - Rotating file handlers prevent disk space issues - Request context automatically included in logs - Performance decorator tracks execution times - Security events logged for audit trails - Admin endpoints for log analysis Admin endpoints: - GET /admin/logs/errors - View recent errors and frequencies - GET /admin/logs/performance - View performance metrics - GET /admin/logs/security - View security events Log types: - talk2me.log - General application logs - errors.log - Dedicated error logging with stack traces - access.log - HTTP request/response logs - security.log - Security events and suspicious activities - performance.log - Performance metrics and timing This provides production-grade observability critical for debugging issues, monitoring performance, and maintaining security in production environments. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
9.7 KiB
Error Logging Documentation
This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues.
Overview
Talk2Me implements a structured logging system that provides:
- JSON-formatted structured logs for easy parsing
- Multiple log streams (app, errors, access, security, performance)
- Automatic log rotation to prevent disk space issues
- Request tracing with unique IDs
- Performance metrics collection
- Security event tracking
- Error deduplication and frequency tracking
Log Types
1. Application Logs (logs/talk2me.log
)
General application logs including info, warnings, and debug messages.
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"logger": "talk2me",
"message": "Whisper model loaded successfully",
"app": "talk2me",
"environment": "production",
"hostname": "server-1",
"thread": "MainThread",
"process": 12345
}
2. Error Logs (logs/errors.log
)
Dedicated error logging with full exception details and stack traces.
{
"timestamp": "2024-01-15T10:31:00.456Z",
"level": "ERROR",
"logger": "talk2me.errors",
"message": "Error in transcribe: File too large",
"exception": {
"type": "ValueError",
"message": "Audio file exceeds maximum size",
"traceback": ["...full stack trace..."]
},
"request_id": "1234567890-abcdef",
"endpoint": "transcribe",
"method": "POST",
"path": "/transcribe",
"ip": "192.168.1.100"
}
3. Access Logs (logs/access.log
)
HTTP request/response logging for traffic analysis.
{
"timestamp": "2024-01-15T10:32:00.789Z",
"level": "INFO",
"message": "request_complete",
"request_id": "1234567890-abcdef",
"method": "POST",
"path": "/transcribe",
"status": 200,
"duration_ms": 1250,
"content_length": 4096,
"ip": "192.168.1.100",
"user_agent": "Mozilla/5.0..."
}
4. Security Logs (logs/security.log
)
Security-related events and suspicious activities.
{
"timestamp": "2024-01-15T10:33:00.123Z",
"level": "WARNING",
"message": "Security event: rate_limit_exceeded",
"event": "rate_limit_exceeded",
"severity": "warning",
"ip": "192.168.1.100",
"endpoint": "/transcribe",
"attempts": 15,
"blocked": true
}
5. Performance Logs (logs/performance.log
)
Performance metrics and slow request tracking.
{
"timestamp": "2024-01-15T10:34:00.456Z",
"level": "INFO",
"message": "Performance metric: transcribe_audio",
"metric": "transcribe_audio",
"duration_ms": 2500,
"function": "transcribe",
"module": "app",
"request_id": "1234567890-abcdef"
}
Configuration
Environment Variables
# Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=INFO
# Log file paths
export LOG_FILE=logs/talk2me.log
export ERROR_LOG_FILE=logs/errors.log
# Log rotation settings
export LOG_MAX_BYTES=52428800 # 50MB
export LOG_BACKUP_COUNT=10 # Keep 10 backup files
# Environment
export FLASK_ENV=production
Flask Configuration
app.config.update({
'LOG_LEVEL': 'INFO',
'LOG_FILE': 'logs/talk2me.log',
'ERROR_LOG_FILE': 'logs/errors.log',
'LOG_MAX_BYTES': 50 * 1024 * 1024,
'LOG_BACKUP_COUNT': 10
})
Admin API Endpoints
GET /admin/logs/errors
View recent error logs and error frequency statistics.
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors
Response:
{
"error_summary": {
"abc123def456": {
"count_last_hour": 5,
"last_seen": 1705320000
}
},
"recent_errors": [...],
"total_errors_logged": 150
}
GET /admin/logs/performance
View performance metrics and slow requests.
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance
Response:
{
"performance_metrics": {
"transcribe_audio": {
"avg_ms": 850.5,
"max_ms": 3200,
"min_ms": 125,
"count": 1024
}
},
"slow_requests": [
{
"metric": "transcribe_audio",
"duration_ms": 3200,
"timestamp": "2024-01-15T10:35:00Z"
}
]
}
GET /admin/logs/security
View security events and suspicious activities.
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security
Response:
{
"security_events": [...],
"event_summary": {
"rate_limit_exceeded": 25,
"suspicious_error": 3,
"high_error_rate": 1
},
"total_events": 29
}
Usage Patterns
1. Logging Errors with Context
from error_logger import log_exception
try:
# Some operation
process_audio(file)
except Exception as e:
log_exception(
e,
message="Failed to process audio",
user_id=user.id,
file_size=file.size,
file_type=file.content_type
)
2. Performance Monitoring
from error_logger import log_performance
@log_performance('expensive_operation')
def process_large_file(file):
# This will automatically log execution time
return processed_data
3. Security Event Logging
app.error_logger.log_security(
'unauthorized_access',
severity='warning',
ip=request.remote_addr,
attempted_resource='/admin',
user_agent=request.headers.get('User-Agent')
)
4. Request Context
from error_logger import log_context
with log_context(user_id=user.id, feature='translation'):
# All logs within this context will include user_id and feature
translate_text(text)
Log Analysis
Finding Specific Errors
# Find all authentication errors
grep '"error_type":"AuthenticationError"' logs/errors.log | jq .
# Find errors from specific IP
grep '"ip":"192.168.1.100"' logs/errors.log | jq .
# Find errors in last hour
grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq .
Performance Analysis
# Find slow requests (>2000ms)
jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log
# Calculate average response time for endpoint
jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}'
Security Monitoring
# Count security events by type
jq '.extra_fields.event' logs/security.log | sort | uniq -c
# Find all blocked IPs
jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u
Log Rotation
Logs are automatically rotated based on size or time:
- Application/Error logs: Rotate at 50MB, keep 10 backups
- Access logs: Daily rotation, keep 30 days
- Performance logs: Hourly rotation, keep 7 days
- Security logs: Rotate at 50MB, keep 10 backups
Rotated logs are named with numeric suffixes:
talk2me.log
(current)talk2me.log.1
(most recent backup)talk2me.log.2
(older backup)- etc.
Best Practices
1. Structured Logging
Always include relevant context:
logger.info("User action completed", extra={
'extra_fields': {
'user_id': user.id,
'action': 'upload_audio',
'file_size': file.size,
'duration_ms': processing_time
}
})
2. Error Handling
Log errors at appropriate levels:
try:
result = risky_operation()
except ValidationError as e:
logger.warning(f"Validation failed: {e}") # Expected errors
except Exception as e:
logger.error(f"Unexpected error: {e}", exc_info=True) # Unexpected errors
3. Performance Tracking
Track key operations:
start = time.time()
result = expensive_operation()
duration = (time.time() - start) * 1000
app.error_logger.log_performance(
'expensive_operation',
value=duration,
input_size=len(data),
output_size=len(result)
)
4. Security Awareness
Log security-relevant events:
if failed_attempts > 3:
app.error_logger.log_security(
'multiple_failed_attempts',
severity='warning',
ip=request.remote_addr,
attempts=failed_attempts
)
Monitoring Integration
Prometheus Metrics
Export log metrics for Prometheus:
@app.route('/metrics')
def prometheus_metrics():
error_summary = app.error_logger.get_error_summary()
# Format as Prometheus metrics
return format_prometheus_metrics(error_summary)
ELK Stack
Ship logs to Elasticsearch:
filebeat.inputs:
- type: log
paths:
- /app/logs/*.log
json.keys_under_root: true
json.add_error_key: true
CloudWatch
For AWS deployments:
# Install boto3 and watchtower
import watchtower
cloudwatch_handler = watchtower.CloudWatchLogHandler()
logger.addHandler(cloudwatch_handler)
Troubleshooting
Common Issues
1. Logs Not Being Written
Check permissions:
ls -la logs/
# Should show write permissions for app user
Create logs directory:
mkdir -p logs
chmod 755 logs
2. Disk Space Issues
Monitor log sizes:
du -sh logs/*
Force rotation:
# Manually rotate logs
mv logs/talk2me.log logs/talk2me.log.backup
# App will create new log file
3. Performance Impact
If logging impacts performance:
- Increase LOG_LEVEL to WARNING or ERROR
- Reduce backup count
- Use asynchronous logging (future enhancement)
Security Considerations
- Log Sanitization: Sensitive data is automatically masked
- Access Control: Admin endpoints require authentication
- Log Retention: Old logs are automatically deleted
- Encryption: Consider encrypting logs at rest in production
- Audit Trail: All log access is itself logged
Future Enhancements
- Centralized Logging: Ship logs to centralized service
- Real-time Alerts: Trigger alerts on error patterns
- Log Analytics: Built-in log analysis dashboard
- Correlation IDs: Track requests across microservices
- Async Logging: Reduce performance impact