About Me

My photo
I am currently working as a DB Consultant with Ashnik. We are into Open Source and Cloud based enterprise class solutions. Prior to this I was working as a DBA @ Misys India (Banking Services). Working on IBM DB2, Oracle Microsoft SQL Server. Selected as IBM Champion for Data Management 2011. I like bit of photography, cycling, poetry and I just love gadgets!!

Thursday, July 11, 2013

Recoverability of PostgreSQL-1: Recovery Concepts

Recently one of my colleagues told me that, it’s very difficult (he actually said "impossible") to do fail back and switchback in PostgreSQL. Well, I thought of giving it a try. In my upcoming blog, I will be discussing the process and details of my experiment. But before that, it’s very important to understand certain concepts that people get generally confused with, thus ending up with a messed up recovery strategy. Let me discuss a few terms which are very important for designing your backup-recovery and disaster-recovery processes:

1) Archiving: Means archiving your transaction/xlog/wal segments/WAL. This is not a substitute for creation of DR site. It is required for taking a hot backup/online backup. And so is the case with at least MS SQL Server and DB2.

2) Backup: A process in which you take a backup of your database. It can be logical where you take a backup of tables and rows or it can be physical where you backup the datafiles and other configuration files needed by your database/instance/cluster. Physical backups can be cold/offline or can be online/hot.

3) Crash Recovery: You had a crash of your database service/server and your database is recovering from the same. Generally if your disk is intact this can be achieved with your WAL segments/WAL itself. Whereas, if your disk has crashed and you are restoring your database from a backup you will need to archive files and latest WAL files for applying the changes since the backup was taken.

4) Disaster Recovery: You had a crash after which your database server is not available and you have to either rebuild the whole system from scratch using a backup or have to start using a DR site you had built for just this scenario.

5) Fail over: A term which refers to using your DR site in case there is a major failure at your primary site.

6) Switchover: A term which refers to switching the roles of your Primary and DR sites either because Primary site has come up again after a recent major failure or due to some maintenance work for which you have to switch the roles.

7) Switchback: A term which means switching back/restoring back the roles of "new primary" and "new DR" sites. i.e. the original Primary site regains control. This is generally a planned activity and everyone desires to have minimal downtime for this.

8) Fail back: A term which is generally used interchangeably for switchback, but I would think it is more of those scenarios when your "new primary" or "switched primary" server goes down for some reason and you have to now promote the "new DR" (which originally was ’primary’ site in start of time) as Primary

9) Warm standby: A standby server which is an exactly (almost real time) replica of your primary server and is always available in recovery mode and cannot be accessed until a recovery is triggered

10) Hot Standby: A standby server which is an exactly (almost real time) replica of your primary server and is always available in recovery mode and cannot be accessed for write commands until a recovery is triggered. But this server can be utilized for read queries

11) WAL Archive Replication: A replication process where the WAL segments which have been archived are copied and replayed on the standby server

12) Streaming Replication: A replication process where the WAL segments are copied directly to the standby servers (standby server need not wait for the WAL files to be archived)

So folks, these are some the terms that would help give you a better insight into understanding the concepts. And if you already have a recovery strategy, then hopefully these terms may have helped you to understand the process in a better way.
I shall be back with a test which does fail-over, switchover (after a fail-over) and then switchback.

PS: Any comments or corrections about the terms and their definition/explanation used here, are more than welcome.

8 comments:

  1. Hi Sam,
    I am Aarya, new to Postgresql.. i m doing Point-in-time recovery concetp, where i am not able to recovery the data completely. only the base data is getting recovery, and later i m getting the below ,
    2015-10-28 14:20:31 IST LOG: database system was shut down at 2013-10-28 14:17:23 IST
    2013-10-28 14:20:31 IST LOG: starting point-in-time recovery to 2013-10-28 14:05:33+05:30
    2013-10-28 14:20:32 IST LOG: restored log file "000000010000000000000011" from archive
    2013-10-28 14:20:32 IST LOG: invalid resource manager ID in primary checkpoint record
    2013-10-28 14:20:32 IST LOG: restored log file "000000010000000000000010" from archive
    2013-10-28 14:20:32 IST LOG: invalid xl_info in secondary checkpoint record
    2013-10-28 14:20:32 IST PANIC: could not locate a valid checkpoint record
    2013-10-28 14:20:32 IST LOG: startup process (PID 8479) was terminated by signal 6: Aborted
    2013-10-28 14:20:32 IST LOG: aborting startup due to startup process failure


    plz any help me on this.. i m not sure what is that error msg " invalid xl_info in secondary checkpoint record" & "invalid resource manager ID in primary checkpoint record" means..

    ReplyDelete
    Replies
    1. Hi,

      Is this your production system? Can you give me the steps that you are using?

      If you need urgent professional support you can contact support@ashnik.com (

      Delete
    2. Hi,
      No, its not for production system. can i giv you the complete PITR video which i have done and the place of error where i m facing it ? the size of video is 27mb.

      Delete
  2. below is the working steps i followd..
    steps of PITR:
    1. root- mkdir walbkp basebkp
    1.1 chown of both walbkp basebkp to enterprisedb
    2. su - enterprisedb- navigate to data, mkdir wals.
    2.1 change settings in vi postgresql.conf as below
    archive_mode=on
    archive_command='cp %p /opt/PostgresPlus/9.1AS/data/wals/%f'
    wal_level=archive
    3. root- restart the cluster using the command
    cd /etc/init.db
    ./ppas-9.1 restart
    4. su - enterprised-- check inside the db where all the settings are been active, using
    show archive_mode or show archive_command or show wal_level and make sure everything is active.
    5. now create table t using pg_class & pg_description, to generate huge amount of data blocks.
    6. while its being created, check whether blocks are moving to wals from pg_xlog.
    7. now start a hotbackup, which is optional, but good to have in case primary backup failures.
    using the command inside the db,
    edb=# select pg_switch_xlog(); -- to flush the wal buffer and move data to xlog.
    edb=# select pg_start_backup('base'); - to start the hotbackup.

    8. in root- navigate to basebkp, create a tar file for base cluster i.e data, using command
    basebkp# tar -czvf /opt/PostgresPlus/9.1/data
    9. su - enterprised-- inside the db, stop the hotbackup process using,
    edb=# select pg_stop_backup();
    10. create tables like t2,t3,t4 etc and also note its time of creation using select current_timestamp.
    note: make sure tables are large in size, so use generate series to create huge entries.
    11. once all the tables are created now crash the cluster using the command,
    kill -9 $(head -1 /opt/PostgresPlus/9.1/data/postmaster.pid)
    12. in root- mv data /olddata -- move crashed cluster.
    13. now untar the base file in basebkp using
    basebkp# tar -xzvf then u ll get a file opt
    now navitage to this opt file as , cd opt/PostgresPlus/9.1AS
    move this data to home directory location /opt/PostgresPlus/9.1AS
    14. su - enterprisedb -- now restart the cluster using command,
    ./pg_ctl -D /opt/PostgresPlus/9.1AS/data/ restart -m f
    15. connect to db and check whether initial table as been recovered.
    16. then quit the db and then stop the cluster using command,
    ./pg_ctl -D /opt/PostgresPlus/9.1AS/data/stop -m f
    17. in root- move the wals information from old cluster to walbkp using command,
    cd /opt/PostgresPlus/9.1AS/olddata/wals
    mv 0* /opt/PostgresPlus/9.1AS/walbkp and make sure all enteries are in enterprisedb permission.
    18. copy oldcluster pg_xlog to new cluster using the command,
    cd /opt/PostgresPlus/9.1AS/olddata/pg_xlog
    cp 0* /opt/PostgresPlus/9.1AS/data/pg_xlog ,L
    note: we will prompt for overwritten for few enteries..
    19. then create a recovery.conf in new cluster data..
    cd /opt/PostgresPlus/9.1AS/data/
    vi recovery.conf
    inside the file, write below as,
    restore_command='cp /opt/PostgresPlus/9.1AS/walbkp/%f %p'
    recovery_target_time='time of table to b recovered' -- give till seconds i.e dd-mon-yy hh:mm:ss
    20. su - enterprisedb, now start the cluster using command,
    ./pg_ctl -D /opt/PostgresPlus/9.1AS/data/ start
    21. now the prob starts here as it cannot login to database when trying with ./psql edb.


    ReplyDelete
    Replies
    1. Can you share the db server log file? I think there is an issue with the value for restore_command and archive_command used by you.



      archive_command should be set to copy the wal files to a location different from wal directory of original database/cluster. restore_command must be set to restore wal files from this location (of archive) to wal directory of new database cluster/instance

      * Create a directory called /pg_arch/ (owneed by Postgres OS user)
      * set archive command archive_command='cp %p /pg_arch/%f' in postgresql.conf
      * set restore_command = 'cp /pg_arch/%f %p'


      Repeat the whole process you have posted.

      Post the result

      Delete
  3. hi.. please find the details below.. still issue was not solved.. i applied above solution as,
    i created pg_arch in
    /opt/Postgresplus/9.1AS/ with enterprisedb permission and given

    cp %p /opt/PostgresPlus/9.1AS/pg_arch/%f --- in postgresql.conf
    in recovery.conf given,
    restore_command='cp /opt/PostgresPlus/9.1AS/pg_arch/%f %p'

    now, prob is, its able to open the db server, bt not restoring the tables. displaying error as below.
    2013-11-21 12:50:59 ISTLOG: database system was shut down at 2013-11-21 12:47:27 IST
    2013-11-21 12:50:59 ISTLOG: starting point-in-time recovery to 2013-11-21 12:42:10+05:30
    2013-11-21 12:50:59 ISTLOG:

    ** EnterpriseDB Dynamic Tuning Agent ********************************************
    * System Utilization: 66 % *
    * Database Version: 9.1.2.2 *
    * Operating System Version: *
    * Number of Processors: 0 *
    * Processor Type: *
    * Processor Architecture: *
    * Database Size: 0.5 GB *
    * RAM: 1.0 GB *
    * Shared Memory: 1011 MB *
    * Max DB Connections: 104 *
    * Autovacuum: on *
    * Autovacuum Naptime: 60 Seconds *
    * InfiniteCache: off *
    * InfiniteCache Servers: 0 *
    * InfiniteCache Size: 0.000 GB *
    *********************************************************************************

    2013-11-21 12:51:00 ISTLOG: restored log file "000000010000000000000021" from archive
    2013-11-21 12:51:00 ISTLOG: invalid resource manager ID in primary checkpoint record
    2013-11-21 12:51:00 ISTLOG: restored log file "000000010000000000000020" from archive
    2013-11-21 12:51:00 ISTLOG: using previous checkpoint record at 0/20007E9C
    2013-11-21 12:51:00 ISTLOG: consistent recovery state reached at 0/20007EEC
    2013-11-21 12:51:00 ISTLOG: redo starts at 0/20007EEC
    2013-11-21 12:51:01 ISTLOG: restored log file "000000010000000000000021" from archive
    2013-11-21 12:51:01 ISTLOG: record with incorrect prev-link 0/20007EC0 at 0/21000020
    2013-11-21 12:51:01 ISTLOG: redo done at 0/20007EEC
    2013-11-21 12:51:01 ISTLOG: restored log file "000000010000000000000020" from archive
    cp: cannot stat `/opt/PostgresPlus/9.1AS/pg_arch/00000002.history': No such file or directory
    2013-11-21 12:51:01 ISTLOG: selected new timeline ID: 2
    cp: cannot stat `/opt/PostgresPlus/9.1AS/pg_arch/00000001.history': No such file or directory
    2013-11-21 12:51:01 ISTLOG: archive recovery complete
    2013-11-21 12:51:01 ISTLOG: autovacuum launcher started
    2013-11-21 12:51:01 ISTLOG: database system is ready to accept connections


    Regards,
    Arvind

    ReplyDelete
    Replies
    1. you must be able to connect to the database after this. Regarding tables not getting recovered, can you confirm the time you have mentioned for PITR in recovery.conf? Is it a timestamp while table was getting created? or after the table was created and committed?

      Delete
    2. this was the info inside recovery.conf file,
      restore_command='cp /opt/PostgresPlus/9.1AS/pg_arch/%f %p'
      recovery_target_time='21-NOV-13 12:42:10'

      Delete