Postgres and Postgres Plus tips and tricks: Recoverability of PostgreSQL-1: Recovery Concepts

Thursday, July 11, 2013

Recoverability of PostgreSQL-1: Recovery Concepts

Recently one of my colleagues told me that, it’s very difficult (he actually said "impossible") to do fail back and switchback in PostgreSQL. Well, I thought of giving it a try. In my upcoming blog, I will be discussing the process and details of my experiment. But before that, it’s very important to understand certain concepts that people get generally confused with, thus ending up with a messed up recovery strategy. Let me discuss a few terms which are very important for designing your backup-recovery and disaster-recovery processes:

1) Archiving: Means archiving your transaction/xlog/wal segments/WAL. This is not a substitute for creation of DR site. It is required for taking a hot backup/online backup. And so is the case with at least MS SQL Server and DB2.

2) Backup: A process in which you take a backup of your database. It can be logical where you take a backup of tables and rows or it can be physical where you backup the datafiles and other configuration files needed by your database/instance/cluster. Physical backups can be cold/offline or can be online/hot.

3) Crash Recovery: You had a crash of your database service/server and your database is recovering from the same. Generally if your disk is intact this can be achieved with your WAL segments/WAL itself. Whereas, if your disk has crashed and you are restoring your database from a backup you will need to archive files and latest WAL files for applying the changes since the backup was taken.

4) Disaster Recovery: You had a crash after which your database server is not available and you have to either rebuild the whole system from scratch using a backup or have to start using a DR site you had built for just this scenario.

5) Fail over: A term which refers to using your DR site in case there is a major failure at your primary site.

6) Switchover: A term which refers to switching the roles of your Primary and DR sites either because Primary site has come up again after a recent major failure or due to some maintenance work for which you have to switch the roles.

7) Switchback: A term which means switching back/restoring back the roles of "new primary" and "new DR" sites. i.e. the original Primary site regains control. This is generally a planned activity and everyone desires to have minimal downtime for this.

8) Fail back: A term which is generally used interchangeably for switchback, but I would think it is more of those scenarios when your "new primary" or "switched primary" server goes down for some reason and you have to now promote the "new DR" (which originally was ’primary’ site in start of time) as Primary

9) Warm standby: A standby server which is an exactly (almost real time) replica of your primary server and is always available in recovery mode and cannot be accessed until a recovery is triggered

10) Hot Standby: A standby server which is an exactly (almost real time) replica of your primary server and is always available in recovery mode and cannot be accessed for write commands until a recovery is triggered. But this server can be utilized for read queries

11) WAL Archive Replication: A replication process where the WAL segments which have been archived are copied and replayed on the standby server

12) Streaming Replication: A replication process where the WAL segments are copied directly to the standby servers (standby server need not wait for the WAL files to be archived)

So folks, these are some the terms that would help give you a better insight into understanding the concepts. And if you already have a recovery strategy, then hopefully these terms may have helped you to understand the process in a better way.

I shall be back with a test which does fail-over, switchover (after a fail-over) and then switchback.

PS: Any comments or corrections about the terms and their definition/explanation used here, are more than welcome.

8 comments:

Tech ShareOctober 28, 2013 at 2:03 AM
Hi Sam,
I am Aarya, new to Postgresql.. i m doing Point-in-time recovery concetp, where i am not able to recovery the data completely. only the base data is getting recovery, and later i m getting the below ,
2015-10-28 14:20:31 IST LOG: database system was shut down at 2013-10-28 14:17:23 IST
2013-10-28 14:20:31 IST LOG: starting point-in-time recovery to 2013-10-28 14:05:33+05:30
2013-10-28 14:20:32 IST LOG: restored log file "000000010000000000000011" from archive
2013-10-28 14:20:32 IST LOG: invalid resource manager ID in primary checkpoint record
2013-10-28 14:20:32 IST LOG: restored log file "000000010000000000000010" from archive
2013-10-28 14:20:32 IST LOG: invalid xl_info in secondary checkpoint record
2013-10-28 14:20:32 IST PANIC: could not locate a valid checkpoint record
2013-10-28 14:20:32 IST LOG: startup process (PID 8479) was terminated by signal 6: Aborted
2013-10-28 14:20:32 IST LOG: aborting startup due to startup process failure

plz any help me on this.. i m not sure what is that error msg " invalid xl_info in secondary checkpoint record" & "invalid resource manager ID in primary checkpoint record" means..
ReplyDelete
Replies
Tech ShareOctober 29, 2013 at 2:53 AM
below is the working steps i followd..
steps of PITR:
1. root- mkdir walbkp basebkp
1.1 chown of both walbkp basebkp to enterprisedb
2. su - enterprisedb- navigate to data, mkdir wals.
2.1 change settings in vi postgresql.conf as below
archive_mode=on
archive_command='cp %p /opt/PostgresPlus/9.1AS/data/wals/%f'
wal_level=archive
3. root- restart the cluster using the command
cd /etc/init.db
./ppas-9.1 restart
4. su - enterprised-- check inside the db where all the settings are been active, using
show archive_mode or show archive_command or show wal_level and make sure everything is active.
5. now create table t using pg_class & pg_description, to generate huge amount of data blocks.
6. while its being created, check whether blocks are moving to wals from pg_xlog.
7. now start a hotbackup, which is optional, but good to have in case primary backup failures.
using the command inside the db,
edb=# select pg_switch_xlog(); -- to flush the wal buffer and move data to xlog.
edb=# select pg_start_backup('base'); - to start the hotbackup.

8. in root- navigate to basebkp, create a tar file for base cluster i.e data, using command
basebkp# tar -czvf /opt/PostgresPlus/9.1/data
9. su - enterprised-- inside the db, stop the hotbackup process using,
edb=# select pg_stop_backup();
10. create tables like t2,t3,t4 etc and also note its time of creation using select current_timestamp.
note: make sure tables are large in size, so use generate series to create huge entries.
11. once all the tables are created now crash the cluster using the command,
kill -9 $(head -1 /opt/PostgresPlus/9.1/data/postmaster.pid)
12. in root- mv data /olddata -- move crashed cluster.
13. now untar the base file in basebkp using
basebkp# tar -xzvf then u ll get a file opt
now navitage to this opt file as , cd opt/PostgresPlus/9.1AS
move this data to home directory location /opt/PostgresPlus/9.1AS
14. su - enterprisedb -- now restart the cluster using command,
./pg_ctl -D /opt/PostgresPlus/9.1AS/data/ restart -m f
15. connect to db and check whether initial table as been recovered.
16. then quit the db and then stop the cluster using command,
./pg_ctl -D /opt/PostgresPlus/9.1AS/data/stop -m f
17. in root- move the wals information from old cluster to walbkp using command,
cd /opt/PostgresPlus/9.1AS/olddata/wals
mv 0* /opt/PostgresPlus/9.1AS/walbkp and make sure all enteries are in enterprisedb permission.
18. copy oldcluster pg_xlog to new cluster using the command,
cd /opt/PostgresPlus/9.1AS/olddata/pg_xlog
cp 0* /opt/PostgresPlus/9.1AS/data/pg_xlog ,L
note: we will prompt for overwritten for few enteries..
19. then create a recovery.conf in new cluster data..
cd /opt/PostgresPlus/9.1AS/data/
vi recovery.conf
inside the file, write below as,
restore_command='cp /opt/PostgresPlus/9.1AS/walbkp/%f %p'
recovery_target_time='time of table to b recovered' -- give till seconds i.e dd-mon-yy hh:mm:ss
20. su - enterprisedb, now start the cluster using command,
./pg_ctl -D /opt/PostgresPlus/9.1AS/data/ start
21. now the prob starts here as it cannot login to database when trying with ./psql edb.

ReplyDelete
Replies
Tech ShareNovember 20, 2013 at 11:34 PM
hi.. please find the details below.. still issue was not solved.. i applied above solution as,
i created pg_arch in
/opt/Postgresplus/9.1AS/ with enterprisedb permission and given

cp %p /opt/PostgresPlus/9.1AS/pg_arch/%f --- in postgresql.conf
in recovery.conf given,
restore_command='cp /opt/PostgresPlus/9.1AS/pg_arch/%f %p'

now, prob is, its able to open the db server, bt not restoring the tables. displaying error as below.
2013-11-21 12:50:59 ISTLOG: database system was shut down at 2013-11-21 12:47:27 IST
2013-11-21 12:50:59 ISTLOG: starting point-in-time recovery to 2013-11-21 12:42:10+05:30
2013-11-21 12:50:59 ISTLOG:

** EnterpriseDB Dynamic Tuning Agent ********************************************
* System Utilization: 66 % *
* Database Version: 9.1.2.2 *
* Operating System Version: *
* Number of Processors: 0 *
* Processor Type: *
* Processor Architecture: *
* Database Size: 0.5 GB *
* RAM: 1.0 GB *
* Shared Memory: 1011 MB *
* Max DB Connections: 104 *
* Autovacuum: on *
* Autovacuum Naptime: 60 Seconds *
* InfiniteCache: off *
* InfiniteCache Servers: 0 *
* InfiniteCache Size: 0.000 GB *
*********************************************************************************

2013-11-21 12:51:00 ISTLOG: restored log file "000000010000000000000021" from archive
2013-11-21 12:51:00 ISTLOG: invalid resource manager ID in primary checkpoint record
2013-11-21 12:51:00 ISTLOG: restored log file "000000010000000000000020" from archive
2013-11-21 12:51:00 ISTLOG: using previous checkpoint record at 0/20007E9C
2013-11-21 12:51:00 ISTLOG: consistent recovery state reached at 0/20007EEC
2013-11-21 12:51:00 ISTLOG: redo starts at 0/20007EEC
2013-11-21 12:51:01 ISTLOG: restored log file "000000010000000000000021" from archive
2013-11-21 12:51:01 ISTLOG: record with incorrect prev-link 0/20007EC0 at 0/21000020
2013-11-21 12:51:01 ISTLOG: redo done at 0/20007EEC
2013-11-21 12:51:01 ISTLOG: restored log file "000000010000000000000020" from archive
cp: cannot stat `/opt/PostgresPlus/9.1AS/pg_arch/00000002.history': No such file or directory
2013-11-21 12:51:01 ISTLOG: selected new timeline ID: 2
cp: cannot stat `/opt/PostgresPlus/9.1AS/pg_arch/00000001.history': No such file or directory
2013-11-21 12:51:01 ISTLOG: archive recovery complete
2013-11-21 12:51:01 ISTLOG: autovacuum launcher started
2013-11-21 12:51:01 ISTLOG: database system is ready to accept connections

Regards,
Arvind

ReplyDelete
Replies

Add comment

Postgres and Postgres Plus tips and tricks

About Me

Connect With Me

Thursday, July 11, 2013

Recoverability of PostgreSQL-1: Recovery Concepts

8 comments: