ποΈRedundancy, Failover, and Backup
Service Health Checks

Every microservice is monitored with real-time health probes:
Liveness checks: Detect if a service is frozen or stuck
Readiness checks: Ensure a service is ready to accept traffic
Integrated into Kubernetes for automated recovery
Alerting is triggered if repeated failures are detected
These checks are run every few seconds with detailed logs and metrics stored for diagnostics.
Auto-Restart Policies
Failed services are automatically restarted using orchestrator policies:
CrashLoopBackOff prevention with exponential backoff and capped retries
Pod auto-restart on failure (Kubernetes
restartPolicy: Always
)Containers are stateless; state is stored in external volumes or databases
Critical services have separate failover containers spun up on different nodes.
Geo-Replication and Database Backups
Data is protected through multiple layers:
Geo-replication: Core databases (PostgreSQL, MongoDB) are mirrored across 2+ regions
Daily encrypted backups to S3-compatible storage
Point-in-time recovery (PITR) available for critical systems
Backup integrity is tested weekly with restore drills
Write-ahead logs (WAL) are also archived for fine-grained recovery.
Rollback Strategies
If a new deployment causes issues:
Blue-green deployments allow quick switch back to the previous version
Canary deployments minimize risk by exposing changes to a small subset of users
Each release is versioned and stored in a container registry
Rollbacks are automated with
kubectl rollout undo
or Helm
Database migrations are reversible with Liquibase or Alembic.
Last updated