user2950 Published in April 25, 2018, 12:28 pm

We have a master slave replication setup that has been working stable for the last year.

Host systems: Debian Jessie

Database version: Postgres 9.4

. Now recently we have experienced a crash/hung master, and after restarting the postgress services on the master the replication stopped working. The master however is running seemingly normal, except for the errors reported when it got restarted no more errors are shown

[10192-1] [unknown]@[unknown] LOG: incomplete startup packet
[10222-1] [unknown]@[unknown] LOG: incomplete startup packet
[10033-2] LOG: replication terminated by primary server
[10033-3] DETAIL: End of WAL reached on timeline 2 at 999/A5687790.
[1082-12] LOG: invalid record length at 999/A5687790
[10239-1] LOG: started streaming WAL from primary at 999/A5000000 on timeline 2
[1064-7] LOG: startup process (PID 1082) exited with exit code 1
[1064-8] LOG: terminating any other active server processes
[18749-1] readonly@pal WARNING: terminating connection because of crash of another server process
[25793-1] _readonly@pal WARNING: terminating connection because of crash of another server process

After a recent crash of the postgres master I'm not able to get the slave to start replicating.

I always get the following error message

13247-2] HINT:  Future log output will go to log destination "syslog".
[13247-3] LOCATION:  PostmasterMain, postmaster.c:1228
[13248-1] LOG:  00000: database system was interrupted while in recovery at log time 2017-12-04 15:10:29 CET
[13248-2] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
[13248-3] LOCATION:  StartupXLOG, xlog.c:6134
[13248-4] LOG:  00000: entering standby mode
[13248-5] LOCATION:  StartupXLOG, xlog.c:6203
[13247-4] LOG:  00000: startup process (PID 13248) exited with exit code 1
[13247-5] LOCATION:  LogChildExit, postmaster.c:3452
[13247-6] LOG:  00000: aborting startup due to startup process failure

I’ve already tried to perform a complete backup and resync procedure on the slave

pg_basebackup -D /var/lib/postgresql/backups/fullbackup -R -h <IP> --checkpoint=fast --username=replic --xlog-method=stream

Which completes without any error message. Also the recovery.conf contains all the information is should but still the error message stays the same.

cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=<user>  password=<passwd> host=IP port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'

Does this mean that the corruption is on the master system and it needs to be restored to a point before it crashed ? Not sure what I can do to get the replication working again ? Any ideas ?

