talk2me/ERROR_LOGGING.md
Adolfo Delorenzo 92b7c41f61 Implement proper error logging - Critical for debugging production issues
This comprehensive error logging system provides structured logging, automatic rotation, and detailed tracking for production debugging.

Key features:
- Structured JSON logging for easy parsing and analysis
- Multiple log streams: app, errors, access, security, performance
- Automatic log rotation to prevent disk space exhaustion
- Request tracing with unique IDs for debugging
- Performance metrics collection with slow request tracking
- Security event logging for suspicious activities
- Error deduplication and frequency tracking
- Full exception details with stack traces

Implementation details:
- StructuredFormatter outputs JSON-formatted logs
- ErrorLogger manages multiple specialized loggers
- Rotating file handlers prevent disk space issues
- Request context automatically included in logs
- Performance decorator tracks execution times
- Security events logged for audit trails
- Admin endpoints for log analysis

Admin endpoints:
- GET /admin/logs/errors - View recent errors and frequencies
- GET /admin/logs/performance - View performance metrics
- GET /admin/logs/security - View security events

Log types:
- talk2me.log - General application logs
- errors.log - Dedicated error logging with stack traces
- access.log - HTTP request/response logs
- security.log - Security events and suspicious activities
- performance.log - Performance metrics and timing

This provides production-grade observability critical for debugging issues, monitoring performance, and maintaining security in production environments.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-06-03 08:11:26 -06:00

460 lines
9.7 KiB
Markdown

# Error Logging Documentation
This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues.
## Overview
Talk2Me implements a structured logging system that provides:
- JSON-formatted structured logs for easy parsing
- Multiple log streams (app, errors, access, security, performance)
- Automatic log rotation to prevent disk space issues
- Request tracing with unique IDs
- Performance metrics collection
- Security event tracking
- Error deduplication and frequency tracking
## Log Types
### 1. Application Logs (`logs/talk2me.log`)
General application logs including info, warnings, and debug messages.
```json
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "INFO",
"logger": "talk2me",
"message": "Whisper model loaded successfully",
"app": "talk2me",
"environment": "production",
"hostname": "server-1",
"thread": "MainThread",
"process": 12345
}
```
### 2. Error Logs (`logs/errors.log`)
Dedicated error logging with full exception details and stack traces.
```json
{
"timestamp": "2024-01-15T10:31:00.456Z",
"level": "ERROR",
"logger": "talk2me.errors",
"message": "Error in transcribe: File too large",
"exception": {
"type": "ValueError",
"message": "Audio file exceeds maximum size",
"traceback": ["...full stack trace..."]
},
"request_id": "1234567890-abcdef",
"endpoint": "transcribe",
"method": "POST",
"path": "/transcribe",
"ip": "192.168.1.100"
}
```
### 3. Access Logs (`logs/access.log`)
HTTP request/response logging for traffic analysis.
```json
{
"timestamp": "2024-01-15T10:32:00.789Z",
"level": "INFO",
"message": "request_complete",
"request_id": "1234567890-abcdef",
"method": "POST",
"path": "/transcribe",
"status": 200,
"duration_ms": 1250,
"content_length": 4096,
"ip": "192.168.1.100",
"user_agent": "Mozilla/5.0..."
}
```
### 4. Security Logs (`logs/security.log`)
Security-related events and suspicious activities.
```json
{
"timestamp": "2024-01-15T10:33:00.123Z",
"level": "WARNING",
"message": "Security event: rate_limit_exceeded",
"event": "rate_limit_exceeded",
"severity": "warning",
"ip": "192.168.1.100",
"endpoint": "/transcribe",
"attempts": 15,
"blocked": true
}
```
### 5. Performance Logs (`logs/performance.log`)
Performance metrics and slow request tracking.
```json
{
"timestamp": "2024-01-15T10:34:00.456Z",
"level": "INFO",
"message": "Performance metric: transcribe_audio",
"metric": "transcribe_audio",
"duration_ms": 2500,
"function": "transcribe",
"module": "app",
"request_id": "1234567890-abcdef"
}
```
## Configuration
### Environment Variables
```bash
# Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=INFO
# Log file paths
export LOG_FILE=logs/talk2me.log
export ERROR_LOG_FILE=logs/errors.log
# Log rotation settings
export LOG_MAX_BYTES=52428800 # 50MB
export LOG_BACKUP_COUNT=10 # Keep 10 backup files
# Environment
export FLASK_ENV=production
```
### Flask Configuration
```python
app.config.update({
'LOG_LEVEL': 'INFO',
'LOG_FILE': 'logs/talk2me.log',
'ERROR_LOG_FILE': 'logs/errors.log',
'LOG_MAX_BYTES': 50 * 1024 * 1024,
'LOG_BACKUP_COUNT': 10
})
```
## Admin API Endpoints
### GET /admin/logs/errors
View recent error logs and error frequency statistics.
```bash
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors
```
Response:
```json
{
"error_summary": {
"abc123def456": {
"count_last_hour": 5,
"last_seen": 1705320000
}
},
"recent_errors": [...],
"total_errors_logged": 150
}
```
### GET /admin/logs/performance
View performance metrics and slow requests.
```bash
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance
```
Response:
```json
{
"performance_metrics": {
"transcribe_audio": {
"avg_ms": 850.5,
"max_ms": 3200,
"min_ms": 125,
"count": 1024
}
},
"slow_requests": [
{
"metric": "transcribe_audio",
"duration_ms": 3200,
"timestamp": "2024-01-15T10:35:00Z"
}
]
}
```
### GET /admin/logs/security
View security events and suspicious activities.
```bash
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security
```
Response:
```json
{
"security_events": [...],
"event_summary": {
"rate_limit_exceeded": 25,
"suspicious_error": 3,
"high_error_rate": 1
},
"total_events": 29
}
```
## Usage Patterns
### 1. Logging Errors with Context
```python
from error_logger import log_exception
try:
# Some operation
process_audio(file)
except Exception as e:
log_exception(
e,
message="Failed to process audio",
user_id=user.id,
file_size=file.size,
file_type=file.content_type
)
```
### 2. Performance Monitoring
```python
from error_logger import log_performance
@log_performance('expensive_operation')
def process_large_file(file):
# This will automatically log execution time
return processed_data
```
### 3. Security Event Logging
```python
app.error_logger.log_security(
'unauthorized_access',
severity='warning',
ip=request.remote_addr,
attempted_resource='/admin',
user_agent=request.headers.get('User-Agent')
)
```
### 4. Request Context
```python
from error_logger import log_context
with log_context(user_id=user.id, feature='translation'):
# All logs within this context will include user_id and feature
translate_text(text)
```
## Log Analysis
### Finding Specific Errors
```bash
# Find all authentication errors
grep '"error_type":"AuthenticationError"' logs/errors.log | jq .
# Find errors from specific IP
grep '"ip":"192.168.1.100"' logs/errors.log | jq .
# Find errors in last hour
grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq .
```
### Performance Analysis
```bash
# Find slow requests (>2000ms)
jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log
# Calculate average response time for endpoint
jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}'
```
### Security Monitoring
```bash
# Count security events by type
jq '.extra_fields.event' logs/security.log | sort | uniq -c
# Find all blocked IPs
jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u
```
## Log Rotation
Logs are automatically rotated based on size or time:
- **Application/Error logs**: Rotate at 50MB, keep 10 backups
- **Access logs**: Daily rotation, keep 30 days
- **Performance logs**: Hourly rotation, keep 7 days
- **Security logs**: Rotate at 50MB, keep 10 backups
Rotated logs are named with numeric suffixes:
- `talk2me.log` (current)
- `talk2me.log.1` (most recent backup)
- `talk2me.log.2` (older backup)
- etc.
## Best Practices
### 1. Structured Logging
Always include relevant context:
```python
logger.info("User action completed", extra={
'extra_fields': {
'user_id': user.id,
'action': 'upload_audio',
'file_size': file.size,
'duration_ms': processing_time
}
})
```
### 2. Error Handling
Log errors at appropriate levels:
```python
try:
result = risky_operation()
except ValidationError as e:
logger.warning(f"Validation failed: {e}") # Expected errors
except Exception as e:
logger.error(f"Unexpected error: {e}", exc_info=True) # Unexpected errors
```
### 3. Performance Tracking
Track key operations:
```python
start = time.time()
result = expensive_operation()
duration = (time.time() - start) * 1000
app.error_logger.log_performance(
'expensive_operation',
value=duration,
input_size=len(data),
output_size=len(result)
)
```
### 4. Security Awareness
Log security-relevant events:
```python
if failed_attempts > 3:
app.error_logger.log_security(
'multiple_failed_attempts',
severity='warning',
ip=request.remote_addr,
attempts=failed_attempts
)
```
## Monitoring Integration
### Prometheus Metrics
Export log metrics for Prometheus:
```python
@app.route('/metrics')
def prometheus_metrics():
error_summary = app.error_logger.get_error_summary()
# Format as Prometheus metrics
return format_prometheus_metrics(error_summary)
```
### ELK Stack
Ship logs to Elasticsearch:
```yaml
filebeat.inputs:
- type: log
paths:
- /app/logs/*.log
json.keys_under_root: true
json.add_error_key: true
```
### CloudWatch
For AWS deployments:
```python
# Install boto3 and watchtower
import watchtower
cloudwatch_handler = watchtower.CloudWatchLogHandler()
logger.addHandler(cloudwatch_handler)
```
## Troubleshooting
### Common Issues
#### 1. Logs Not Being Written
Check permissions:
```bash
ls -la logs/
# Should show write permissions for app user
```
Create logs directory:
```bash
mkdir -p logs
chmod 755 logs
```
#### 2. Disk Space Issues
Monitor log sizes:
```bash
du -sh logs/*
```
Force rotation:
```bash
# Manually rotate logs
mv logs/talk2me.log logs/talk2me.log.backup
# App will create new log file
```
#### 3. Performance Impact
If logging impacts performance:
- Increase LOG_LEVEL to WARNING or ERROR
- Reduce backup count
- Use asynchronous logging (future enhancement)
## Security Considerations
1. **Log Sanitization**: Sensitive data is automatically masked
2. **Access Control**: Admin endpoints require authentication
3. **Log Retention**: Old logs are automatically deleted
4. **Encryption**: Consider encrypting logs at rest in production
5. **Audit Trail**: All log access is itself logged
## Future Enhancements
1. **Centralized Logging**: Ship logs to centralized service
2. **Real-time Alerts**: Trigger alerts on error patterns
3. **Log Analytics**: Built-in log analysis dashboard
4. **Correlation IDs**: Track requests across microservices
5. **Async Logging**: Reduce performance impact