This comprehensive error logging system provides structured logging, automatic rotation, and detailed tracking for production debugging. Key features: - Structured JSON logging for easy parsing and analysis - Multiple log streams: app, errors, access, security, performance - Automatic log rotation to prevent disk space exhaustion - Request tracing with unique IDs for debugging - Performance metrics collection with slow request tracking - Security event logging for suspicious activities - Error deduplication and frequency tracking - Full exception details with stack traces Implementation details: - StructuredFormatter outputs JSON-formatted logs - ErrorLogger manages multiple specialized loggers - Rotating file handlers prevent disk space issues - Request context automatically included in logs - Performance decorator tracks execution times - Security events logged for audit trails - Admin endpoints for log analysis Admin endpoints: - GET /admin/logs/errors - View recent errors and frequencies - GET /admin/logs/performance - View performance metrics - GET /admin/logs/security - View security events Log types: - talk2me.log - General application logs - errors.log - Dedicated error logging with stack traces - access.log - HTTP request/response logs - security.log - Security events and suspicious activities - performance.log - Performance metrics and timing This provides production-grade observability critical for debugging issues, monitoring performance, and maintaining security in production environments. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
460 lines
9.7 KiB
Markdown
460 lines
9.7 KiB
Markdown
# Error Logging Documentation
|
|
|
|
This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues.
|
|
|
|
## Overview
|
|
|
|
Talk2Me implements a structured logging system that provides:
|
|
- JSON-formatted structured logs for easy parsing
|
|
- Multiple log streams (app, errors, access, security, performance)
|
|
- Automatic log rotation to prevent disk space issues
|
|
- Request tracing with unique IDs
|
|
- Performance metrics collection
|
|
- Security event tracking
|
|
- Error deduplication and frequency tracking
|
|
|
|
## Log Types
|
|
|
|
### 1. Application Logs (`logs/talk2me.log`)
|
|
General application logs including info, warnings, and debug messages.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-15T10:30:45.123Z",
|
|
"level": "INFO",
|
|
"logger": "talk2me",
|
|
"message": "Whisper model loaded successfully",
|
|
"app": "talk2me",
|
|
"environment": "production",
|
|
"hostname": "server-1",
|
|
"thread": "MainThread",
|
|
"process": 12345
|
|
}
|
|
```
|
|
|
|
### 2. Error Logs (`logs/errors.log`)
|
|
Dedicated error logging with full exception details and stack traces.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-15T10:31:00.456Z",
|
|
"level": "ERROR",
|
|
"logger": "talk2me.errors",
|
|
"message": "Error in transcribe: File too large",
|
|
"exception": {
|
|
"type": "ValueError",
|
|
"message": "Audio file exceeds maximum size",
|
|
"traceback": ["...full stack trace..."]
|
|
},
|
|
"request_id": "1234567890-abcdef",
|
|
"endpoint": "transcribe",
|
|
"method": "POST",
|
|
"path": "/transcribe",
|
|
"ip": "192.168.1.100"
|
|
}
|
|
```
|
|
|
|
### 3. Access Logs (`logs/access.log`)
|
|
HTTP request/response logging for traffic analysis.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-15T10:32:00.789Z",
|
|
"level": "INFO",
|
|
"message": "request_complete",
|
|
"request_id": "1234567890-abcdef",
|
|
"method": "POST",
|
|
"path": "/transcribe",
|
|
"status": 200,
|
|
"duration_ms": 1250,
|
|
"content_length": 4096,
|
|
"ip": "192.168.1.100",
|
|
"user_agent": "Mozilla/5.0..."
|
|
}
|
|
```
|
|
|
|
### 4. Security Logs (`logs/security.log`)
|
|
Security-related events and suspicious activities.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-15T10:33:00.123Z",
|
|
"level": "WARNING",
|
|
"message": "Security event: rate_limit_exceeded",
|
|
"event": "rate_limit_exceeded",
|
|
"severity": "warning",
|
|
"ip": "192.168.1.100",
|
|
"endpoint": "/transcribe",
|
|
"attempts": 15,
|
|
"blocked": true
|
|
}
|
|
```
|
|
|
|
### 5. Performance Logs (`logs/performance.log`)
|
|
Performance metrics and slow request tracking.
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-15T10:34:00.456Z",
|
|
"level": "INFO",
|
|
"message": "Performance metric: transcribe_audio",
|
|
"metric": "transcribe_audio",
|
|
"duration_ms": 2500,
|
|
"function": "transcribe",
|
|
"module": "app",
|
|
"request_id": "1234567890-abcdef"
|
|
}
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
|
|
export LOG_LEVEL=INFO
|
|
|
|
# Log file paths
|
|
export LOG_FILE=logs/talk2me.log
|
|
export ERROR_LOG_FILE=logs/errors.log
|
|
|
|
# Log rotation settings
|
|
export LOG_MAX_BYTES=52428800 # 50MB
|
|
export LOG_BACKUP_COUNT=10 # Keep 10 backup files
|
|
|
|
# Environment
|
|
export FLASK_ENV=production
|
|
```
|
|
|
|
### Flask Configuration
|
|
|
|
```python
|
|
app.config.update({
|
|
'LOG_LEVEL': 'INFO',
|
|
'LOG_FILE': 'logs/talk2me.log',
|
|
'ERROR_LOG_FILE': 'logs/errors.log',
|
|
'LOG_MAX_BYTES': 50 * 1024 * 1024,
|
|
'LOG_BACKUP_COUNT': 10
|
|
})
|
|
```
|
|
|
|
## Admin API Endpoints
|
|
|
|
### GET /admin/logs/errors
|
|
View recent error logs and error frequency statistics.
|
|
|
|
```bash
|
|
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"error_summary": {
|
|
"abc123def456": {
|
|
"count_last_hour": 5,
|
|
"last_seen": 1705320000
|
|
}
|
|
},
|
|
"recent_errors": [...],
|
|
"total_errors_logged": 150
|
|
}
|
|
```
|
|
|
|
### GET /admin/logs/performance
|
|
View performance metrics and slow requests.
|
|
|
|
```bash
|
|
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"performance_metrics": {
|
|
"transcribe_audio": {
|
|
"avg_ms": 850.5,
|
|
"max_ms": 3200,
|
|
"min_ms": 125,
|
|
"count": 1024
|
|
}
|
|
},
|
|
"slow_requests": [
|
|
{
|
|
"metric": "transcribe_audio",
|
|
"duration_ms": 3200,
|
|
"timestamp": "2024-01-15T10:35:00Z"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### GET /admin/logs/security
|
|
View security events and suspicious activities.
|
|
|
|
```bash
|
|
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"security_events": [...],
|
|
"event_summary": {
|
|
"rate_limit_exceeded": 25,
|
|
"suspicious_error": 3,
|
|
"high_error_rate": 1
|
|
},
|
|
"total_events": 29
|
|
}
|
|
```
|
|
|
|
## Usage Patterns
|
|
|
|
### 1. Logging Errors with Context
|
|
|
|
```python
|
|
from error_logger import log_exception
|
|
|
|
try:
|
|
# Some operation
|
|
process_audio(file)
|
|
except Exception as e:
|
|
log_exception(
|
|
e,
|
|
message="Failed to process audio",
|
|
user_id=user.id,
|
|
file_size=file.size,
|
|
file_type=file.content_type
|
|
)
|
|
```
|
|
|
|
### 2. Performance Monitoring
|
|
|
|
```python
|
|
from error_logger import log_performance
|
|
|
|
@log_performance('expensive_operation')
|
|
def process_large_file(file):
|
|
# This will automatically log execution time
|
|
return processed_data
|
|
```
|
|
|
|
### 3. Security Event Logging
|
|
|
|
```python
|
|
app.error_logger.log_security(
|
|
'unauthorized_access',
|
|
severity='warning',
|
|
ip=request.remote_addr,
|
|
attempted_resource='/admin',
|
|
user_agent=request.headers.get('User-Agent')
|
|
)
|
|
```
|
|
|
|
### 4. Request Context
|
|
|
|
```python
|
|
from error_logger import log_context
|
|
|
|
with log_context(user_id=user.id, feature='translation'):
|
|
# All logs within this context will include user_id and feature
|
|
translate_text(text)
|
|
```
|
|
|
|
## Log Analysis
|
|
|
|
### Finding Specific Errors
|
|
|
|
```bash
|
|
# Find all authentication errors
|
|
grep '"error_type":"AuthenticationError"' logs/errors.log | jq .
|
|
|
|
# Find errors from specific IP
|
|
grep '"ip":"192.168.1.100"' logs/errors.log | jq .
|
|
|
|
# Find errors in last hour
|
|
grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq .
|
|
```
|
|
|
|
### Performance Analysis
|
|
|
|
```bash
|
|
# Find slow requests (>2000ms)
|
|
jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log
|
|
|
|
# Calculate average response time for endpoint
|
|
jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}'
|
|
```
|
|
|
|
### Security Monitoring
|
|
|
|
```bash
|
|
# Count security events by type
|
|
jq '.extra_fields.event' logs/security.log | sort | uniq -c
|
|
|
|
# Find all blocked IPs
|
|
jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u
|
|
```
|
|
|
|
## Log Rotation
|
|
|
|
Logs are automatically rotated based on size or time:
|
|
|
|
- **Application/Error logs**: Rotate at 50MB, keep 10 backups
|
|
- **Access logs**: Daily rotation, keep 30 days
|
|
- **Performance logs**: Hourly rotation, keep 7 days
|
|
- **Security logs**: Rotate at 50MB, keep 10 backups
|
|
|
|
Rotated logs are named with numeric suffixes:
|
|
- `talk2me.log` (current)
|
|
- `talk2me.log.1` (most recent backup)
|
|
- `talk2me.log.2` (older backup)
|
|
- etc.
|
|
|
|
## Best Practices
|
|
|
|
### 1. Structured Logging
|
|
|
|
Always include relevant context:
|
|
```python
|
|
logger.info("User action completed", extra={
|
|
'extra_fields': {
|
|
'user_id': user.id,
|
|
'action': 'upload_audio',
|
|
'file_size': file.size,
|
|
'duration_ms': processing_time
|
|
}
|
|
})
|
|
```
|
|
|
|
### 2. Error Handling
|
|
|
|
Log errors at appropriate levels:
|
|
```python
|
|
try:
|
|
result = risky_operation()
|
|
except ValidationError as e:
|
|
logger.warning(f"Validation failed: {e}") # Expected errors
|
|
except Exception as e:
|
|
logger.error(f"Unexpected error: {e}", exc_info=True) # Unexpected errors
|
|
```
|
|
|
|
### 3. Performance Tracking
|
|
|
|
Track key operations:
|
|
```python
|
|
start = time.time()
|
|
result = expensive_operation()
|
|
duration = (time.time() - start) * 1000
|
|
|
|
app.error_logger.log_performance(
|
|
'expensive_operation',
|
|
value=duration,
|
|
input_size=len(data),
|
|
output_size=len(result)
|
|
)
|
|
```
|
|
|
|
### 4. Security Awareness
|
|
|
|
Log security-relevant events:
|
|
```python
|
|
if failed_attempts > 3:
|
|
app.error_logger.log_security(
|
|
'multiple_failed_attempts',
|
|
severity='warning',
|
|
ip=request.remote_addr,
|
|
attempts=failed_attempts
|
|
)
|
|
```
|
|
|
|
## Monitoring Integration
|
|
|
|
### Prometheus Metrics
|
|
|
|
Export log metrics for Prometheus:
|
|
```python
|
|
@app.route('/metrics')
|
|
def prometheus_metrics():
|
|
error_summary = app.error_logger.get_error_summary()
|
|
# Format as Prometheus metrics
|
|
return format_prometheus_metrics(error_summary)
|
|
```
|
|
|
|
### ELK Stack
|
|
|
|
Ship logs to Elasticsearch:
|
|
```yaml
|
|
filebeat.inputs:
|
|
- type: log
|
|
paths:
|
|
- /app/logs/*.log
|
|
json.keys_under_root: true
|
|
json.add_error_key: true
|
|
```
|
|
|
|
### CloudWatch
|
|
|
|
For AWS deployments:
|
|
```python
|
|
# Install boto3 and watchtower
|
|
import watchtower
|
|
cloudwatch_handler = watchtower.CloudWatchLogHandler()
|
|
logger.addHandler(cloudwatch_handler)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### 1. Logs Not Being Written
|
|
|
|
Check permissions:
|
|
```bash
|
|
ls -la logs/
|
|
# Should show write permissions for app user
|
|
```
|
|
|
|
Create logs directory:
|
|
```bash
|
|
mkdir -p logs
|
|
chmod 755 logs
|
|
```
|
|
|
|
#### 2. Disk Space Issues
|
|
|
|
Monitor log sizes:
|
|
```bash
|
|
du -sh logs/*
|
|
```
|
|
|
|
Force rotation:
|
|
```bash
|
|
# Manually rotate logs
|
|
mv logs/talk2me.log logs/talk2me.log.backup
|
|
# App will create new log file
|
|
```
|
|
|
|
#### 3. Performance Impact
|
|
|
|
If logging impacts performance:
|
|
- Increase LOG_LEVEL to WARNING or ERROR
|
|
- Reduce backup count
|
|
- Use asynchronous logging (future enhancement)
|
|
|
|
## Security Considerations
|
|
|
|
1. **Log Sanitization**: Sensitive data is automatically masked
|
|
2. **Access Control**: Admin endpoints require authentication
|
|
3. **Log Retention**: Old logs are automatically deleted
|
|
4. **Encryption**: Consider encrypting logs at rest in production
|
|
5. **Audit Trail**: All log access is itself logged
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Centralized Logging**: Ship logs to centralized service
|
|
2. **Real-time Alerts**: Trigger alerts on error patterns
|
|
3. **Log Analytics**: Built-in log analysis dashboard
|
|
4. **Correlation IDs**: Track requests across microservices
|
|
5. **Async Logging**: Reduce performance impact |