Hi,
I am doing some testing with pg_rewind and I encountered a very strange
situation. So, let me describe the scenario:
1. Create a master (server1) node (max_wal_size = 2GB,
checkpoint_timeout = 10min)
2. Create a test database on it (server1)
3. Create a test table (server1)
4. Insert a couple of million rows ( >> 5GB) but cancel the query at
some point leaving a few gigs of useless WAL files (server1)
5. Insert a couple of thousand rows ( < 1GB) (server1)
6. Build a slave (server2) and connect to the master (server1)
7. Promote the slave (server2 now master)
8. Insert one row on the old master (server1)
9. Shutdown the old master (server1)
10. Issue a pg_rewind on the old master (server1)
11. Configure recovery.conf and start the new slave (server1)
All of this under 10 minutes :)
The timeline switch is at:
1 1/B2A64F80 no recovery target specified
The pg_controldata on the slave (server1) gives:
Latest checkpoint location: 1/B3000108
Prior checkpoint location: 1/B3000060
Latest checkpoint's REDO location: 1/B30000D0
Latest checkpoint's REDO WAL file: 0000000200000001000000B3
But when starting the slave i get the following errors in the log file:
2016-11-28 15:38:44 UTC: WAL 00000002.history successfully restored
2016-11-28 15:38:43 UTC [6]: [14-1] user=,db= LOG: restored log file
"00000002.history" from archive
2016-11-28 15:38:44 UTC: Error copying WAL 00000002000000010000009D!
2016-11-28 15:38:44 UTC: Error copying WAL 00000001000000010000009D!
2016-11-28 15:38:44 UTC [494]: [1-1] user=,db= LOG: started streaming
WAL from primary at 1/9D000000 on timeline 2
2016-11-28 15:38:44 UTC [494]: [2-1] user=,db= FATAL: could not receive
data from WAL stream: ERROR: requested WAL segment
00000002000000010000009D has already been removed
So the problem is that the slave (server1) went back in time, to a
location that is not on the new master (first WAL available on the new
master is 0000000100000001000000AE). The old archive on the slave
(server1) is gone (deleted) so there is no way to fix this. And this is
not a problem per se, because you can use pg_basebackup or rsync instead
of pg_rewind. The real problem is that I can't diagnose this problem
until I start the server process and read the logs. Shouldn't
pg_controldata report the first WAL needed to start the instance?
Regards,
Mladen Marinović
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general