🎚️Redundancy, Failover, and Backup

Service Health Checks

Every microservice is monitored with real-time health probes:

  • Liveness checks: Detect if a service is frozen or stuck

  • Readiness checks: Ensure a service is ready to accept traffic

  • Integrated into Kubernetes for automated recovery

  • Alerting is triggered if repeated failures are detected

These checks are run every few seconds with detailed logs and metrics stored for diagnostics.


Auto-Restart Policies

Failed services are automatically restarted using orchestrator policies:

  • CrashLoopBackOff prevention with exponential backoff and capped retries

  • Pod auto-restart on failure (Kubernetes restartPolicy: Always)

  • Containers are stateless; state is stored in external volumes or databases

Critical services have separate failover containers spun up on different nodes.


Geo-Replication and Database Backups

Data is protected through multiple layers:

  • Geo-replication: Core databases (PostgreSQL, MongoDB) are mirrored across 2+ regions

  • Daily encrypted backups to S3-compatible storage

  • Point-in-time recovery (PITR) available for critical systems

  • Backup integrity is tested weekly with restore drills

Write-ahead logs (WAL) are also archived for fine-grained recovery.


Rollback Strategies

If a new deployment causes issues:

  • Blue-green deployments allow quick switch back to the previous version

  • Canary deployments minimize risk by exposing changes to a small subset of users

  • Each release is versioned and stored in a container registry

  • Rollbacks are automated with kubectl rollout undo or Helm

Database migrations are reversible with Liquibase or Alembic.

Last updated