Alvaro Herrera wrote: > Anyway here's a quick script to almost-reproduce the problem. Meh. Really attached now. I also wanted to post the error messages we got: 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset 147456: Success. 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure We pg_xlogdumped the offending segment and see that there is a checkpoint record with oldestMulti=4624559. Curiously, there are no other records with rmgr=MultiXact in that segment. Note that the file exists (otherwise the error would say "could not open" rather than "could not read"); this is also why errno says "Success" rather than some actual error; this is the code: errno = 0; if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ) { slru_errcause = SLRU_READ_FAILED; slru_errno = errno; CloseTransientFile(fd); return false; } My guess is that the file existed, and perhaps had one or more pages, but the wanted page doesn't exist, so we tried to read but got 0 bytes back. read() returns 0 in this case but doesn't set errno. I didn't find a way to set things so that the file exists but is of shorter contents than oldestMulti by the time the checkpoint record is replayed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment:
repro-chkpt-replay-failure.sh
Description: Bourne shell script
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general