talk2me/ERROR_LOGGING.md

# Error Logging Documentation

This document describes the comprehensive error logging system implemented in Talk2Me for debugging production issues.

## Overview

Talk2Me implements a structured logging system that provides:
- JSON-formatted structured logs for easy parsing
- Multiple log streams (app, errors, access, security, performance)
- Automatic log rotation to prevent disk space issues
- Request tracing with unique IDs
- Performance metrics collection
- Security event tracking
- Error deduplication and frequency tracking

## Log Types

### 1. Application Logs (`logs/talk2me.log`)
General application logs including info, warnings, and debug messages.

```json
{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "INFO",
  "logger": "talk2me",
  "message": "Whisper model loaded successfully",
  "app": "talk2me",
  "environment": "production",
  "hostname": "server-1",
  "thread": "MainThread",
  "process": 12345
}
```

### 2. Error Logs (`logs/errors.log`)
Dedicated error logging with full exception details and stack traces.

```json
{
  "timestamp": "2024-01-15T10:31:00.456Z",
  "level": "ERROR",
  "logger": "talk2me.errors",
  "message": "Error in transcribe: File too large",
  "exception": {
    "type": "ValueError",
    "message": "Audio file exceeds maximum size",
    "traceback": ["...full stack trace..."]
  },
  "request_id": "1234567890-abcdef",
  "endpoint": "transcribe",
  "method": "POST",
  "path": "/transcribe",
  "ip": "192.168.1.100"
}
```

### 3. Access Logs (`logs/access.log`)
HTTP request/response logging for traffic analysis.

```json
{
  "timestamp": "2024-01-15T10:32:00.789Z",
  "level": "INFO",
  "message": "request_complete",
  "request_id": "1234567890-abcdef",
  "method": "POST",
  "path": "/transcribe",
  "status": 200,
  "duration_ms": 1250,
  "content_length": 4096,
  "ip": "192.168.1.100",
  "user_agent": "Mozilla/5.0..."
}
```

### 4. Security Logs (`logs/security.log`)
Security-related events and suspicious activities.

```json
{
  "timestamp": "2024-01-15T10:33:00.123Z",
  "level": "WARNING",
  "message": "Security event: rate_limit_exceeded",
  "event": "rate_limit_exceeded",
  "severity": "warning",
  "ip": "192.168.1.100",
  "endpoint": "/transcribe",
  "attempts": 15,
  "blocked": true
}
```

### 5. Performance Logs (`logs/performance.log`)
Performance metrics and slow request tracking.

```json
{
  "timestamp": "2024-01-15T10:34:00.456Z",
  "level": "INFO",
  "message": "Performance metric: transcribe_audio",
  "metric": "transcribe_audio",
  "duration_ms": 2500,
  "function": "transcribe",
  "module": "app",
  "request_id": "1234567890-abcdef"
}
```

## Configuration

### Environment Variables

```bash
# Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=INFO

# Log file paths
export LOG_FILE=logs/talk2me.log
export ERROR_LOG_FILE=logs/errors.log

# Log rotation settings
export LOG_MAX_BYTES=52428800      # 50MB
export LOG_BACKUP_COUNT=10         # Keep 10 backup files

# Environment
export FLASK_ENV=production
```

### Flask Configuration

```python
app.config.update({
    'LOG_LEVEL': 'INFO',
    'LOG_FILE': 'logs/talk2me.log',
    'ERROR_LOG_FILE': 'logs/errors.log',
    'LOG_MAX_BYTES': 50 * 1024 * 1024,
    'LOG_BACKUP_COUNT': 10
})
```

## Admin API Endpoints

### GET /admin/logs/errors
View recent error logs and error frequency statistics.

```bash
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/errors
```

Response:
```json
{
  "error_summary": {
    "abc123def456": {
      "count_last_hour": 5,
      "last_seen": 1705320000
    }
  },
  "recent_errors": [...],
  "total_errors_logged": 150
}
```

### GET /admin/logs/performance
View performance metrics and slow requests.

```bash
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/performance
```

Response:
```json
{
  "performance_metrics": {
    "transcribe_audio": {
      "avg_ms": 850.5,
      "max_ms": 3200,
      "min_ms": 125,
      "count": 1024
    }
  },
  "slow_requests": [
    {
      "metric": "transcribe_audio",
      "duration_ms": 3200,
      "timestamp": "2024-01-15T10:35:00Z"
    }
  ]
}
```

### GET /admin/logs/security
View security events and suspicious activities.

```bash
curl -H "X-Admin-Token: your-token" http://localhost:5005/admin/logs/security
```

Response:
```json
{
  "security_events": [...],
  "event_summary": {
    "rate_limit_exceeded": 25,
    "suspicious_error": 3,
    "high_error_rate": 1
  },
  "total_events": 29
}
```

## Usage Patterns

### 1. Logging Errors with Context

```python
from error_logger import log_exception

try:
    # Some operation
    process_audio(file)
except Exception as e:
    log_exception(
        e,
        message="Failed to process audio",
        user_id=user.id,
        file_size=file.size,
        file_type=file.content_type
    )
```

### 2. Performance Monitoring

```python
from error_logger import log_performance

@log_performance('expensive_operation')
def process_large_file(file):
    # This will automatically log execution time
    return processed_data
```

### 3. Security Event Logging

```python
app.error_logger.log_security(
    'unauthorized_access',
    severity='warning',
    ip=request.remote_addr,
    attempted_resource='/admin',
    user_agent=request.headers.get('User-Agent')
)
```

### 4. Request Context

```python
from error_logger import log_context

with log_context(user_id=user.id, feature='translation'):
    # All logs within this context will include user_id and feature
    translate_text(text)
```

## Log Analysis

### Finding Specific Errors

```bash
# Find all authentication errors
grep '"error_type":"AuthenticationError"' logs/errors.log | jq .

# Find errors from specific IP
grep '"ip":"192.168.1.100"' logs/errors.log | jq .

# Find errors in last hour
grep "$(date -u -d '1 hour ago' +%Y-%m-%dT%H)" logs/errors.log | jq .
```

### Performance Analysis

```bash
# Find slow requests (>2000ms)
jq 'select(.extra_fields.duration_ms > 2000)' logs/performance.log

# Calculate average response time for endpoint
jq 'select(.extra_fields.metric == "transcribe_audio") | .extra_fields.duration_ms' logs/performance.log | awk '{sum+=$1; count++} END {print sum/count}'
```

### Security Monitoring

```bash
# Count security events by type
jq '.extra_fields.event' logs/security.log | sort | uniq -c

# Find all blocked IPs
jq 'select(.extra_fields.blocked == true) | .extra_fields.ip' logs/security.log | sort -u
```

## Log Rotation

Logs are automatically rotated based on size or time:

- **Application/Error logs**: Rotate at 50MB, keep 10 backups
- **Access logs**: Daily rotation, keep 30 days
- **Performance logs**: Hourly rotation, keep 7 days
- **Security logs**: Rotate at 50MB, keep 10 backups

Rotated logs are named with numeric suffixes:
- `talk2me.log` (current)
- `talk2me.log.1` (most recent backup)
- `talk2me.log.2` (older backup)
- etc.

## Best Practices

### 1. Structured Logging

Always include relevant context:
```python
logger.info("User action completed", extra={
    'extra_fields': {
        'user_id': user.id,
        'action': 'upload_audio',
        'file_size': file.size,
        'duration_ms': processing_time
    }
})
```

### 2. Error Handling

Log errors at appropriate levels:
```python
try:
    result = risky_operation()
except ValidationError as e:
    logger.warning(f"Validation failed: {e}")  # Expected errors
except Exception as e:
    logger.error(f"Unexpected error: {e}", exc_info=True)  # Unexpected errors
```

### 3. Performance Tracking

Track key operations:
```python
start = time.time()
result = expensive_operation()
duration = (time.time() - start) * 1000

app.error_logger.log_performance(
    'expensive_operation',
    value=duration,
    input_size=len(data),
    output_size=len(result)
)
```

### 4. Security Awareness

Log security-relevant events:
```python
if failed_attempts > 3:
    app.error_logger.log_security(
        'multiple_failed_attempts',
        severity='warning',
        ip=request.remote_addr,
        attempts=failed_attempts
    )
```

## Monitoring Integration

### Prometheus Metrics

Export log metrics for Prometheus:
```python
@app.route('/metrics')
def prometheus_metrics():
    error_summary = app.error_logger.get_error_summary()
    # Format as Prometheus metrics
    return format_prometheus_metrics(error_summary)
```

### ELK Stack

Ship logs to Elasticsearch:
```yaml
filebeat.inputs:
- type: log
  paths:
    - /app/logs/*.log
  json.keys_under_root: true
  json.add_error_key: true
```

### CloudWatch

For AWS deployments:
```python
# Install boto3 and watchtower
import watchtower
cloudwatch_handler = watchtower.CloudWatchLogHandler()
logger.addHandler(cloudwatch_handler)
```

## Troubleshooting

### Common Issues

#### 1. Logs Not Being Written

Check permissions:
```bash
ls -la logs/
# Should show write permissions for app user
```

Create logs directory:
```bash
mkdir -p logs
chmod 755 logs
```

#### 2. Disk Space Issues

Monitor log sizes:
```bash
du -sh logs/*
```

Force rotation:
```bash
# Manually rotate logs
mv logs/talk2me.log logs/talk2me.log.backup
# App will create new log file
```

#### 3. Performance Impact

If logging impacts performance:
- Increase LOG_LEVEL to WARNING or ERROR
- Reduce backup count
- Use asynchronous logging (future enhancement)

## Security Considerations

1. **Log Sanitization**: Sensitive data is automatically masked
2. **Access Control**: Admin endpoints require authentication
3. **Log Retention**: Old logs are automatically deleted
4. **Encryption**: Consider encrypting logs at rest in production
5. **Audit Trail**: All log access is itself logged

## Future Enhancements

1. **Centralized Logging**: Ship logs to centralized service
2. **Real-time Alerts**: Trigger alerts on error patterns
3. **Log Analytics**: Built-in log analysis dashboard
4. **Correlation IDs**: Track requests across microservices
5. **Async Logging**: Reduce performance impact