In my last post, I had discussed about certain concepts of backup,recovery and High Availability. And as promised, in this post, I will be covering
specifics about implementation of Hot Standby server, failover and switch
over scenarios.
Step 1: Installation
The first step would be to install PPAS (same version) on two different
servers (running same OS/OS version). For putting up this blog, I tried these
steps on 2 servers, each running Postgres Plus Advanced Server 9.2.1 on RHEL
6.1. For the sake of simplicity, instead of overwriting the default DB
cluster/instance on secondary db server, I created a different db instance
- Primary DB Server:
- Name:dbserver1
- IP Address: 192.168.160.147
- Data directory:
/opt/PostgresPlus/9.2AS/data
- Port: 5444
- Stand by DB Server
- Name:dbserver2
- IP Address: 192.168.160.150
- Data Directory:
/opt/PostgresPlus/9.2AS/data2
- Port: 5222
Step 2: Parameter Setup
Change the below parameter:
- wal_level
= hot_standby
- max_wal_senders
= 3
- wal_keep_segments
= 128 (optional/depending on load)
- replication_timeout
= 5 sec (optional)
- hot_standby
= on (required/effective only for hot stand by server)
To ease the pain of back-up-restore, failover-failback I created two
postgresql.conf backups on primary server (under data directory)
1.
postgresql.conf.dbserver1
2.
postgresql.conf.dbserver2
Both the files are same, with only difference in the value for
port (dbserver1=5444 and dbserver2=5222).
The value for replication related parameters and hot_standby
same in both the files. As the replication parameters are not going to cause
any difference on secondary server unless you use cascaded replication and hot_standby
value is ignored on Primary Server.
Add a new value in pg_hba.conf
host replication all
192.168.160.0/24 trust
Step 3: Create recovery.conf
Create two dummy recovery.conf files namely recovery.conf.dbserver1_is_master
and recovery.conf.dbserver2_is_master.
recovery.conf.dbserver1_is_master
standby_mode = 'on'
primary_conninfo =
'host=192.168.160.147 port=5444 user=enterprisedb password=ashnik'
restore_command =
'scp enterprisedb@192.168.160.150:/mnt/arch/%f %p'
# optional
# needs archiving command to be enabled on primary
recovery_target_timeline
= 'latest' #optional
trigger_file =
'/opt/PostgresPlus/9.2AS/data/recover.trigger'
recovery.conf.dbserver2_is_master
standby_mode = 'on'
primary_conninfo =
'host=192.168.160.150 port=5222 user=enterprisedb password=ashnik'
restore_command =
'scp enterprisedb@192.168.160.150:/mnt/arch/%f %p'
# optional
# needs archiving command to be enabled on primary
recovery_target_timeline
= 'latest' #optional
trigger_file =
'/opt/PostgresPlus/9.2AS/data/recover.trigger'
For the ease of management, you should keep the passwords
same on both the servers. Here the password is used in plain text, but one
can always use password files or md5 password.
Step 4: Backup
You need to take a cold backup of your primary server and
restore it on secondary server. With rest of the files in
the backup you will also get below files:
- postgresql.conf.dbserver1
-
postgresql.conf.dbserver2
- recovery.conf.dbserver1_is_master
-
recovery.conf.dbserver2_is_master
Step 5: Setup of Standby Server
Now copy the postgresql.conf.dbserver2 as postgresql.conf on
dbserver2.
Then copy the recover.conf.dbserver1_is_master as
recovery.conf
Now start the db cluster on secondary server first
to make sure it is not going to lose any transaction. You will see an
error in log complaining about non-availability of primary
server, which can be ignored at this point. Confirm that secondary
database instance is up and running and you can connect and
fire read queries on the database.
Step 6: Start-up the Primary Server
Once sure, start the primary server.
Check the primary and secondary server. Connect to each of
them to confirm the connectivity, role and read/write
accessibility.
You can use “pg_ctl status” OS command and “pg_is_in_recovery()” SQL
function to confirm the status of each database.
Now let’s try to do a
failover.
Step 7: Create a Failure
So for that, first there has to be a failure. On the primary
site, fire a command
kill -9 <>
Now connect to the secondary database, check if you can do
write operation or simply check if it’s still in recovery state by
pg_is_in_recovery() function. So do you still get “t” as the output
of the above command? Are you still not able to write operations on the
secondary database?
Well don’t be surprised, actually PostgreSQL does not do failover
on its own. There have always been two school of
thoughts about any failover mechanism, one which says “auto failover”
and other which says “manual failover”. If given a choice, I always
prefer a manual failover for DR site. This ensures that my DR site does not
mistakenly assume a network failover as a disaster or a failure. Moreover, the DR
site is not just database, one also needs to make sure the
application and client connectivity is modified accordingly. Hence
it’s best to keep it manual. Auto-failover is useful if you want to do hot-streaming
replication for HA (which I would discuss in a later post in this
series).
Step 8: Switch Secondary/slave database
to become read/write
On the secondary server create the trigger file (as
specified in recovery.conf)
touch opt/PostgresPlus/9.2AS/data2/recover.trigger
Now connect again and check if your current live sessions/new
sessions to the secondary database are able to do write operations to the
database. If yes, then great! You just completed a successful failover.
Step 9: Rebuilding the Master Database
Now, we need to re-build the master (assuming that the master database
server is up).
First of all clean up the database cluster on old Primary server:
rm -Rf
/opt/PostgresPlus/9.2AS/data
Now, take a backup from new primary (dbserver2) to rebuild the master:
pg_basebackup -D
/opt/PostgresPlus/9.2AS/data -h 192.168.160.150 -W -p 5222 -Fp
--xlog-method=stream
Once the backup is complete, copy the postgresql.conf.dbserver1 as
postgresql.conf and then remove recovery.done and copy the
recover.conf.dbserver2_is_master as recovery.conf.
Step 10: Start the
Primary DB as new slave
Now start the db cluster on master node. Once the start-up is
successful, make sure everything is alright by connecting to the database and
firing the below command:
SELECT pg_is_in_recovery(); #expected output is “t”
Then fire the below command on dbserver1 and dbserver2 and both should
be same:
SELECT txid_current_snapshot();
Carefully inspect the log files on secondary node (dbserver1) to confirm
the recovery is in progress and there is no issues in the replication.
I have dodged a few details in this post like:
1.
Details of pg_basebackup command (to take a backup), which will be
included in one of my upcoming posts in this series
2.
Details of archive logging and its setup – This is going to be the
next post
hi I run command on old primary. I got error message.
ReplyDeletepg_basebackup: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "10.120.10.45", user "postgres", SSL off
[root@192.168.160.147 ~]# pg_basebackup -D /var/lib/pgsql/9.2/data -h 192.168.160.150 -W -p 5432 -Fp --xlog-method=stream
ReplyDeletePassword:
pg_basebackup: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.160.147", user "root", SSL off
[root@t192.168.160.147 ~]#
you need to add an entry in pg_hba.conf where username is the user you will use for pg_basebackup and databasename is "replication". pg_basebackup uses the same protocol as used by streaming replication and hence you may have to increase your wal senders in postgresql.conf
ReplyDeleteSamir, please note that under step 7 your command is incomplete:
ReplyDeleteYou write `kill -9 <>` instead of `kill -9 `
See this post: https://askubuntu.com/questions/1006088/what-does-the-mean-in-kill-9