Recently one of my colleagues told me that, it’s very difficult (he actually said "impossible") to do fail back and switchback in PostgreSQL. Well, I thought of giving it a try. In my upcoming blog, I will be discussing the process and details of my experiment. But before that, it’s very important to understand certain concepts that people get generally confused with, thus ending up with a messed up recovery strategy. Let me discuss a few terms which are very important for designing your backup-recovery and disaster-recovery processes:
1) Archiving: Means archiving your transaction/xlog/wal segments/WAL. This is not a substitute for creation of DR site. It is required for taking a hot backup/online backup. And so is the case with at least MS SQL Server and DB2.
2) Backup: A process in which you take a backup of your database. It can be logical where you take a backup of tables and rows or it can be physical where you backup the datafiles and other configuration files needed by your database/instance/cluster. Physical backups can be cold/offline or can be online/hot.
3) Crash Recovery: You had a crash of your database service/server and your database is recovering from the same. Generally if your disk is intact this can be achieved with your WAL segments/WAL itself. Whereas, if your disk has crashed and you are restoring your database from a backup you will need to archive files and latest WAL files for applying the changes since the backup was taken.
4) Disaster Recovery: You had a crash after which your database server is not available and you have to either rebuild the whole system from scratch using a backup or have to start using a DR site you had built for just this scenario.
5) Fail over: A term which refers to using your DR site in case there is a major failure at your primary site.
6) Switchover: A term which refers to switching the roles of your Primary and DR sites either because Primary site has come up again after a recent major failure or due to some maintenance work for which you have to switch the roles.
7) Switchback: A term which means switching back/restoring back the roles of "new primary" and "new DR" sites. i.e. the original Primary site regains control. This is generally a planned activity and everyone desires to have minimal downtime for this.
8) Fail back: A term which is generally used interchangeably for switchback, but I would think it is more of those scenarios when your "new primary" or "switched primary" server goes down for some reason and you have to now promote the "new DR" (which originally was ’primary’ site in start of time) as Primary
9) Warm standby: A standby server which is an exactly (almost real time) replica of your primary server and is always available in recovery mode and cannot be accessed until a recovery is triggered
10) Hot Standby: A standby server which is an exactly (almost real time) replica of your primary server and is always available in recovery mode and cannot be accessed for write commands until a recovery is triggered. But this server can be utilized for read queries
11) WAL Archive Replication: A replication process where the WAL segments which have been archived are copied and replayed on the standby server
12) Streaming Replication: A replication process where the WAL segments are copied directly to the standby servers (standby server need not wait for the WAL files to be archived)
So folks, these are some the terms that would help give you a better insight into understanding the concepts. And if you already have a recovery strategy, then hopefully these terms may have helped you to understand the process in a better way.
I shall be back with a test which does fail-over, switchover (after a fail-over) and then switchback.