On 27 December 2010 18:24, hubert depesz lubaczewski <depesz@xxxxxxxxxx> wrote: > hi, > this mail will be relatively long, because I need to explain in details > what/how I do. > > Simple summary: > > when restoring from hot backup, with out xlogs in pg_xlog/, but instead > using recovery.conf to get xlogs from wal archive, I get "WAL ends > before consistent recovery point" in case it shouldn't happen. > > Details: > > I have following setup: > > master pg is sending wal segments to 2 places: > - /shared/walarchive - long term storage, compressed content > - /shared/4backups - short term storage, just to do hot backups on slave, not compressed > > for sending I use omnipitr-archive, with 2 "-dr" options > > slave pg is configured to read wal segments from /shared/4backups (using `cp -i > ...`) *and* then to switch to streaming replication. > > additionally, on slave there is archive cleanup program, which cleans > /shared/4backups but only if /flags/backup file is not present (which is > only present when backup happens). > > to make backups I use omnipitr-backup-slave, which does: > > 1. creates /flags/backup file > 2. gets snapshot of pg_controldata > 3. makes backup_label in $PGDATA, where START WAL LOCATION is taken from > "Latest checkpoint's REDO location" from step #2, and "CHECKPOINT > LOCATION:" comes from "Latest checkpoint location" in #2 > 4. compresses whole $PGDATA > 5. gets snapshot of pg_controldata > 6. waits till new snapshot of pg_controldata (makes another one every 5 > seconds) will have different (from step #4) value in "Latest checkpoint > location" > 7. makes .backup file in /shared/4backups that contains the same > information as #3, but with "STOP WAL LOCATION" taken from "Latest > checkpoint location" from #6 > 8. compresses /shared/4backups to secondary backup file > > afterwards I get 2 files - data.tar.gz and xlog.tar.gz. > > Now. I got those files, moved them to 3rd server, uncompressed. > > Afterwards I got PGDATA directory, with xlogs in pg_xlog, and no > recovery.conf. > > When starting from this backup - pg starts, reads all necessary segments > from pg_xlog, finished recovery, and starts without any problem as > standalone. > > but, when i, after extraction, before starting - remove all xlog files, > and move .backup file from pg_xlog to /shared/walarchive, and setup > recovery.conf using normal omnipitr-restore, to get wal segments from > /shared/walarchive - pg starts to behave weirdly. > > It starts. Reads all segments from the first mentioned in backup_label > to the latest available in /shared/walarchvie, and then it fails with > "WAL ends before consistent recovery point" - but this is *well* past > the point that it should know that the state is consistent! > > Files content from my situation: > > =$ cat backup_label.old > START WAL LOCATION: 33/2213B200 (file 000000010000003300000022) > CHECKPOINT LOCATION: 33/23101410 > START TIME: 2010-12-27 11:53:44 GMT > LABEL: OmniPITR_Slave_Hot_Backup > > (the file got already renamed to .old by pg recovery process) > > =$ zcat /mnt/db/prod/walarchive/000000010000003300000022.0013B200.backup.gz > START WAL LOCATION: 33/2213B200 (file 000000010000003300000022) > STOP WAL LOCATION: 33/2C24C800 (file 00000001000000330000002C) > CHECKPOINT LOCATION: 33/23101410 > START TIME: 2010-12-27 11:53:44 GMT > START TIME: 2010-12-27 12:06:17 GMT > LABEL: OmniPITR_Slave_Hot_Backup > > last 20 lines from log; > > =$ tail -n 20 pg_log/postgresql-2010-12-27_172759.log > 2010-12-27 17:44:16 UTC [15768]: [1934-1] LOG: restored log file "000000010000003A000000B1" from archive > 2010-12-27 17:44:16 UTC [15768]: [1935-1] LOG: restored log file "000000010000003A000000B2" from archive > 2010-12-27 17:44:16 UTC [15768]: [1936-1] LOG: restored log file "000000010000003A000000B3" from archive > 2010-12-27 17:44:17 UTC [15768]: [1937-1] LOG: restored log file "000000010000003A000000B4" from archive > 2010-12-27 17:44:17 UTC [15768]: [1938-1] LOG: restored log file "000000010000003A000000B5" from archive > 2010-12-27 17:44:18 UTC [15768]: [1939-1] LOG: restored log file "000000010000003A000000B6" from archive > 2010-12-27 17:44:18 UTC [15768]: [1940-1] LOG: restored log file "000000010000003A000000B7" from archive > 2010-12-27 17:44:19 UTC [15768]: [1941-1] LOG: restored log file "000000010000003A000000B8" from archive > 2010-12-27 17:44:19 UTC [15768]: [1942-1] LOG: restored log file "000000010000003A000000B9" from archive > 2010-12-27 17:44:20 UTC [15768]: [1943-1] LOG: restored log file "000000010000003A000000BA" from archive > 2010-12-27 17:44:20 UTC [15768]: [1944-1] LOG: restored log file "000000010000003A000000BB" from archive > 2010-12-27 17:44:21 UTC [15768]: [1945-1] LOG: restored log file "000000010000003A000000BC" from archive > 2010-12-27 17:44:21 UTC [15768]: [1946-1] LOG: restored log file "000000010000003A000000BD" from archive > 2010-12-27 17:44:23 UTC [15768]: [1947-1] LOG: could not open file "pg_xlog/000000010000003A000000BE" (log file 58, segment 190): No such file or directory > 2010-12-27 17:44:23 UTC [15768]: [1948-1] LOG: redo done at 3A/BDFFF6E8 > 2010-12-27 17:44:23 UTC [15768]: [1949-1] LOG: last completed transaction was at log time 2010-12-27 17:43:23.558028+00 > 2010-12-27 17:44:23 UTC [15768]: [1950-1] LOG: restored log file "000000010000003A000000BD" from archive > 2010-12-27 17:44:23 UTC [15768]: [1951-1] FATAL: WAL ends before consistent recovery point > 2010-12-27 17:44:23 UTC [15765]: [1-1] LOG: startup process (PID 15768) exited with exit code 1 > 2010-12-27 17:44:23 UTC [15765]: [2-1] LOG: terminating any other active server processes > > First 50 lines from log: > > =$ head -n 50 pg_log/postgresql-2010-12-27_172759.log > 2010-12-27 17:27:59 UTC [15768]: [1-1] LOG: database system was interrupted while in recovery at log time 2010-12-27 11:54:55 UTC > 2010-12-27 17:27:59 UTC [15768]: [2-1] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. > 2010-12-27 17:27:59 UTC [15768]: [3-1] LOG: starting archive recovery > 2010-12-27 17:28:00 UTC [15768]: [4-1] LOG: restored log file "000000010000003300000023" from archive > 2010-12-27 17:28:00 UTC [15768]: [5-1] LOG: restored log file "000000010000003300000022" from archive > 2010-12-27 17:28:00 UTC [15768]: [6-1] LOG: redo starts at 33/2213B200 > 2010-12-27 17:28:01 UTC [15768]: [7-1] LOG: restored log file "000000010000003300000023" from archive > 2010-12-27 17:28:01 UTC [15768]: [8-1] LOG: restored log file "000000010000003300000024" from archive > 2010-12-27 17:28:01 UTC [15768]: [9-1] LOG: restored log file "000000010000003300000025" from archive > 2010-12-27 17:28:02 UTC [15768]: [10-1] LOG: restored log file "000000010000003300000026" from archive > 2010-12-27 17:28:02 UTC [15768]: [11-1] LOG: restored log file "000000010000003300000027" from archive > 2010-12-27 17:28:03 UTC [15768]: [12-1] LOG: restored log file "000000010000003300000028" from archive > 2010-12-27 17:28:03 UTC [15768]: [13-1] LOG: restored log file "000000010000003300000029" from archive > 2010-12-27 17:28:03 UTC [15768]: [14-1] LOG: restored log file "00000001000000330000002A" from archive > 2010-12-27 17:28:04 UTC [15768]: [15-1] LOG: restored log file "00000001000000330000002B" from archive > 2010-12-27 17:28:04 UTC [15768]: [16-1] LOG: restored log file "00000001000000330000002C" from archive > 2010-12-27 17:28:04 UTC [15768]: [17-1] LOG: restored log file "00000001000000330000002D" from archive > 2010-12-27 17:28:05 UTC [15768]: [18-1] LOG: restored log file "00000001000000330000002E" from archive > 2010-12-27 17:28:05 UTC [15768]: [19-1] LOG: restored log file "00000001000000330000002F" from archive > 2010-12-27 17:28:06 UTC [15768]: [20-1] LOG: restored log file "000000010000003300000030" from archive > 2010-12-27 17:28:06 UTC [15768]: [21-1] LOG: restored log file "000000010000003300000031" from archive > 2010-12-27 17:28:07 UTC [15768]: [22-1] LOG: restored log file "000000010000003300000032" from archive > 2010-12-27 17:28:07 UTC [15768]: [23-1] LOG: restored log file "000000010000003300000033" from archive > 2010-12-27 17:28:08 UTC [15768]: [24-1] LOG: restored log file "000000010000003300000034" from archive > 2010-12-27 17:28:08 UTC [15768]: [25-1] LOG: restored log file "000000010000003300000035" from archive > 2010-12-27 17:28:08 UTC [15768]: [26-1] LOG: restored log file "000000010000003300000036" from archive > 2010-12-27 17:28:09 UTC [15768]: [27-1] LOG: restored log file "000000010000003300000037" from archive > 2010-12-27 17:28:09 UTC [15768]: [28-1] LOG: restored log file "000000010000003300000038" from archive > 2010-12-27 17:28:10 UTC [15768]: [29-1] LOG: restored log file "000000010000003300000039" from archive > 2010-12-27 17:28:10 UTC [15768]: [30-1] LOG: restored log file "00000001000000330000003A" from archive > 2010-12-27 17:28:11 UTC [15768]: [31-1] LOG: restored log file "00000001000000330000003B" from archive > 2010-12-27 17:28:11 UTC [15768]: [32-1] LOG: restored log file "00000001000000330000003C" from archive > 2010-12-27 17:28:12 UTC [15768]: [33-1] LOG: restored log file "00000001000000330000003D" from archive > 2010-12-27 17:28:12 UTC [15768]: [34-1] LOG: restored log file "00000001000000330000003E" from archive > 2010-12-27 17:28:12 UTC [15768]: [35-1] LOG: restored log file "00000001000000330000003F" from archive > 2010-12-27 17:28:13 UTC [15768]: [36-1] LOG: restored log file "000000010000003300000040" from archive > 2010-12-27 17:28:13 UTC [15768]: [37-1] LOG: restored log file "000000010000003300000041" from archive > 2010-12-27 17:28:14 UTC [15768]: [38-1] LOG: restored log file "000000010000003300000042" from archive > 2010-12-27 17:28:14 UTC [15768]: [39-1] LOG: restored log file "000000010000003300000043" from archive > 2010-12-27 17:28:14 UTC [15768]: [40-1] LOG: restored log file "000000010000003300000044" from archive > 2010-12-27 17:28:15 UTC [15768]: [41-1] LOG: restored log file "000000010000003300000045" from archive > 2010-12-27 17:28:15 UTC [15768]: [42-1] LOG: restored log file "000000010000003300000046" from archive > 2010-12-27 17:28:15 UTC [15768]: [43-1] LOG: restored log file "000000010000003300000047" from archive > 2010-12-27 17:28:16 UTC [15768]: [44-1] LOG: restored log file "000000010000003300000048" from archive > 2010-12-27 17:28:16 UTC [15768]: [45-1] LOG: restored log file "000000010000003300000049" from archive > 2010-12-27 17:28:16 UTC [15768]: [46-1] LOG: restored log file "00000001000000330000004A" from archive > 2010-12-27 17:28:17 UTC [15768]: [47-1] LOG: restored log file "00000001000000330000004B" from archive > 2010-12-27 17:28:17 UTC [15768]: [48-1] LOG: restored log file "00000001000000330000004C" from archive > 2010-12-27 17:28:18 UTC [15768]: [49-1] LOG: restored log file "00000001000000330000004D" from archive > 2010-12-27 17:28:18 UTC [15768]: [50-1] LOG: restored log file "00000001000000330000004E" from archive > > > pg_controldata: > > =$ /usr/pgsql-9.0/bin/pg_controldata . > pg_control version number: 903 > Catalog version number: 201008051 > Database system identifier: 5554952960675921391 > Database cluster state: in archive recovery > pg_control last modified: Mon 27 Dec 2010 05:43:36 PM GMT > Latest checkpoint location: 3A/6377BB8 > Prior checkpoint location: 37/7D93A540 > Latest checkpoint's REDO location: 39/EB0350B8 > Latest checkpoint's TimeLineID: 1 > Latest checkpoint's NextXID: 0/1359777 > Latest checkpoint's NextOID: 294922 > Latest checkpoint's NextMultiXactId: 10752 > Latest checkpoint's NextMultiOffset: 30580 > Latest checkpoint's oldestXID: 654 > Latest checkpoint's oldestXID's DB: 1 > Latest checkpoint's oldestActiveXID: 1306958 > Time of latest checkpoint: Mon 27 Dec 2010 04:50:53 PM GMT > Minimum recovery ending location: 3A/B7895958 > Backup start location: 33/2213B200 > Current wal_level setting: hot_standby > Current max_connections setting: 1024 > Current max_prepared_xacts setting: 10 > Current max_locks_per_xact setting: 64 > Maximum data alignment: 8 > Database block size: 8192 > Blocks per segment of large relation: 131072 > WAL block size: 8192 > Bytes per WAL segment: 16777216 > Maximum length of identifiers: 64 > Maximum columns in an index: 32 > Maximum size of a TOAST chunk: 1996 > Date/time type storage: 64-bit integers > Float4 argument passing: by value > Float8 argument passing: by value > > > Lines from logs that are *not* about recovered log file: > > =$ grep -nvE 'LOG: restored log file' pg_log/postgresql-2010-12-27_172759.log > 1:2010-12-27 17:27:59 UTC [15768]: [1-1] LOG: database system was interrupted while in recovery at log time 2010-12-27 11:54:55 UTC > 2:2010-12-27 17:27:59 UTC [15768]: [2-1] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. > 3:2010-12-27 17:27:59 UTC [15768]: [3-1] LOG: starting archive recovery > 6:2010-12-27 17:28:00 UTC [15768]: [6-1] LOG: redo starts at 33/2213B200 > 581:2010-12-27 17:33:00 UTC [15781]: [1-1] LOG: restartpoint starting: time > 666:2010-12-27 17:33:47 UTC [15781]: [2-1] LOG: restartpoint complete: wrote 9775 buffers (14.9%); write=38.593 s, sync=5.409 s, total=46.922 s > 667:2010-12-27 17:33:47 UTC [15781]: [3-1] LOG: recovery restart point at 35/13010BE8 > 668:2010-12-27 17:33:47 UTC [15781]: [4-1] DETAIL: last completed transaction was at log time 2010-12-27 16:00:12.119251+00 > 1172:2010-12-27 17:38:00 UTC [15781]: [5-1] LOG: restartpoint starting: time > 1276:2010-12-27 17:38:47 UTC [15781]: [6-1] LOG: restartpoint complete: wrote 10779 buffers (16.4%); write=44.044 s, sync=2.710 s, total=47.082 s > 1277:2010-12-27 17:38:47 UTC [15781]: [7-1] LOG: recovery restart point at 37/5F01F358 > 1278:2010-12-27 17:38:47 UTC [15781]: [8-1] DETAIL: last completed transaction was at log time 2010-12-27 16:32:19.637622+00 > 1787:2010-12-27 17:43:00 UTC [15781]: [9-1] LOG: restartpoint starting: time > 1859:2010-12-27 17:43:36 UTC [15781]: [10-1] LOG: restartpoint complete: wrote 6314 buffers (9.6%); write=24.869 s, sync=9.906 s, total=36.024 s > 1860:2010-12-27 17:43:36 UTC [15781]: [11-1] LOG: recovery restart point at 39/EB0350B8 > 1861:2010-12-27 17:43:36 UTC [15781]: [12-1] DETAIL: last completed transaction was at log time 2010-12-27 17:07:35.913562+00 > 1959:2010-12-27 17:44:23 UTC [15768]: [1947-1] LOG: could not open file "pg_xlog/000000010000003A000000BE" (log file 58, segment 190): No such file or directory > 1960:2010-12-27 17:44:23 UTC [15768]: [1948-1] LOG: redo done at 3A/BDFFF6E8 > 1961:2010-12-27 17:44:23 UTC [15768]: [1949-1] LOG: last completed transaction was at log time 2010-12-27 17:43:23.558028+00 > 1963:2010-12-27 17:44:23 UTC [15768]: [1951-1] FATAL: WAL ends before consistent recovery point > 1964:2010-12-27 17:44:23 UTC [15765]: [1-1] LOG: startup process (PID 15768) exited with exit code 1 > 1965:2010-12-27 17:44:23 UTC [15765]: [2-1] LOG: terminating any other active server processes > > > Since pg on its own starts from this backup, and we do use > omnipitr-backup-slave on many hosts, and there was never problem with it > - i'm reluctant to assume the bug is in there. > > so, it leaves out pebkac or some bug in pg? > > Best regards, > > depesz Depesz, did you ever resolve this? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general