talk2me/RATE_LIMITING.md

# Rate Limiting Documentation

This document describes the rate limiting implementation in Talk2Me to protect against DoS attacks and resource exhaustion.

## Overview

Talk2Me implements a comprehensive rate limiting system with:
- Token bucket algorithm with sliding window
- Per-endpoint configurable limits
- IP-based blocking (temporary and permanent)
- Global request limits
- Concurrent request throttling
- Request size validation

## Rate Limits by Endpoint

### Transcription (`/transcribe`)
- **Per Minute**: 10 requests
- **Per Hour**: 100 requests
- **Burst Size**: 3 requests
- **Max Request Size**: 10MB
- **Token Refresh**: 1 token per 6 seconds

### Translation (`/translate`)
- **Per Minute**: 20 requests
- **Per Hour**: 300 requests
- **Burst Size**: 5 requests
- **Max Request Size**: 100KB
- **Token Refresh**: 1 token per 3 seconds

### Streaming Translation (`/translate/stream`)
- **Per Minute**: 10 requests
- **Per Hour**: 150 requests
- **Burst Size**: 3 requests
- **Max Request Size**: 100KB
- **Token Refresh**: 1 token per 6 seconds

### Text-to-Speech (`/speak`)
- **Per Minute**: 15 requests
- **Per Hour**: 200 requests
- **Burst Size**: 3 requests
- **Max Request Size**: 50KB
- **Token Refresh**: 1 token per 4 seconds

### API Endpoints
- Push notifications, error logging: Various limits (see code)

## Global Limits

- **Total Requests Per Minute**: 1,000 (across all endpoints)
- **Total Requests Per Hour**: 10,000
- **Concurrent Requests**: 50 maximum

## Rate Limiting Headers

Successful responses include:
```
X-RateLimit-Limit: 20
X-RateLimit-Remaining: 15
X-RateLimit-Reset: 1234567890
```

Rate limited responses (429) include:
```
X-RateLimit-Limit: 20
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1234567890
Retry-After: 60
```

## Client Identification

Clients are identified by:
- IP address (including X-Forwarded-For support)
- User-Agent string
- Combined hash for uniqueness

## Automatic Blocking

IPs are temporarily blocked for 1 hour if:
- They exceed 100 requests per minute
- They repeatedly hit rate limits
- They exhibit suspicious patterns

## Configuration

### Environment Variables

```bash
# No direct environment variables for rate limiting
# Configured in code - can be extended to use env vars
```

### Programmatic Configuration

Rate limits can be adjusted in `rate_limiter.py`:

```python
self.endpoint_limits = {
    '/transcribe': {
        'requests_per_minute': 10,
        'requests_per_hour': 100,
        'burst_size': 3,
        'token_refresh_rate': 0.167,
        'max_request_size': 10 * 1024 * 1024  # 10MB
    }
}
```

## Admin Endpoints

### Get Rate Limit Configuration
```bash
curl -H "X-Admin-Token: your-admin-token" \
  http://localhost:5005/admin/rate-limits
```

### Get Rate Limit Statistics
```bash
# Global stats
curl -H "X-Admin-Token: your-admin-token" \
  http://localhost:5005/admin/rate-limits/stats

# Client-specific stats
curl -H "X-Admin-Token: your-admin-token" \
  http://localhost:5005/admin/rate-limits/stats?client_id=abc123
```

### Block IP Address
```bash
# Temporary block (1 hour)
curl -X POST -H "X-Admin-Token: your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{"ip": "192.168.1.100", "duration": 3600}' \
  http://localhost:5005/admin/block-ip

# Permanent block
curl -X POST -H "X-Admin-Token: your-admin-token" \
  -H "Content-Type: application/json" \
  -d '{"ip": "192.168.1.100", "permanent": true}' \
  http://localhost:5005/admin/block-ip
```

## Algorithm Details

### Token Bucket
- Each client gets a bucket with configurable burst size
- Tokens regenerate at a fixed rate
- Requests consume tokens
- Empty bucket = request denied

### Sliding Window
- Tracks requests in the last minute and hour
- More accurate than fixed windows
- Prevents gaming the system at window boundaries

## Best Practices

### For Users
1. Implement exponential backoff when receiving 429 errors
2. Check rate limit headers to avoid hitting limits
3. Cache responses when possible
4. Use bulk operations where available

### For Administrators
1. Monitor rate limit statistics regularly
2. Adjust limits based on usage patterns
3. Use IP blocking sparingly
4. Set up alerts for suspicious activity

## Error Responses

### Rate Limited (429)
```json
{
  "error": "Rate limit exceeded (per minute)",
  "retry_after": 60
}
```

### Request Too Large (413)
```json
{
  "error": "Request too large"
}
```

### IP Blocked (429)
```json
{
  "error": "IP temporarily blocked due to excessive requests"
}
```

## Monitoring

Key metrics to monitor:
- Rate limit hits by endpoint
- Blocked IPs
- Concurrent request peaks
- Request size violations
- Global limit approaches

## Performance Impact

- Minimal overhead (~1-2ms per request)
- Memory usage scales with active clients
- Automatic cleanup of old buckets
- Thread-safe implementation

## Security Considerations

1. **DoS Protection**: Prevents resource exhaustion
2. **Burst Control**: Limits sudden traffic spikes
3. **Size Validation**: Prevents large payload attacks
4. **IP Blocking**: Stops persistent attackers
5. **Global Limits**: Protects overall system capacity

## Troubleshooting

### "Rate limit exceeded" errors
- Check client request patterns
- Verify time synchronization
- Look for retry loops
- Check IP blocking status

### Memory usage increasing
- Verify cleanup thread is running
- Check for client ID explosion
- Monitor bucket count

### Legitimate users blocked
- Review rate limit settings
- Check for shared IP issues
- Implement IP whitelisting if needed