warm standby resume and take online problems

Michal Bicz <michal.bicz@xxxxxxxxxxxxxxx> · Wed, 4 Nov 2009 07:32:33 -0800

Hi,

I have chain of warm stanby servers.
One let's say db-01 is pushing updates to db-02 and then they are fetched to db-03.
I decided to bring up online db-04 and stopped db-03 from warm standby with 
pg_ctl stop -m fast $PG_DATA
And copied data over from db-03 to db-04.

So now I have backup "data + binaries" that was taken from warm stanby when shut down.

I have created recovery.conf with recovery_command, created recovery.sh (for recovery command), adjusted postgresql.conf with apropriate port + IP.

recovery.sh is just a blind 'while' that is looking for trigger file then is ending.

So I started:

Removed everything from pg_xlog on backup that is going to be live.

pg_controldata output:
v pg_control version number:            822
Catalog version number:               200611241
Database system identifier:           5309237009736268543
Database cluster state:               in archive recovery
pg_control last modified:             Thu Oct 29 11:30:04 2009
Current log file ID:                  389
Next log file segment:                225
Latest checkpoint location:           2FA/BBA6B710
Prior checkpoint location:            2FA/AE916D60
Latest checkpoint's REDO location:    2FA/BBA38478
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          3/824035978
Latest checkpoint's NextOID:          59442871
Latest checkpoint's NextMultiXactId:  510637
Latest checkpoint's NextMultiOffset:  2076981
Time of latest checkpoint:            Thu Oct 29 09:02:31 2009
Minimum recovery ending location:     186/80DCC48
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           en_US.UTF-8
LC_CTYPE:                             en_US.UTF-8

First start ( no wal files in wal_recovery directory)
2009-11-01 16:09:10 PST : LOG:  could not open file "pg_xlog/00000001000002FA000000BB" (log 
file 762, segment 187): No such file or directory
2009-11-01 16:09:10 PST : LOG:  invalid primary checkpoint record
2009-11-01 16:09:10 PST : LOG:  could not open file "pg_xlog/00000001000002FA000000AE" (log 
file 762, segment 174): No such file or directory
2009-11-01 16:09:10 PST : LOG:  invalid secondary checkpoint record
2009-11-01 16:09:10 PST : PANIC:  could not locate a valid checkpoint record
2009-11-01 16:09:10 PST : LOG:  startup process (PID 1651) was terminated by signal 6
2009-11-01 16:09:10 PST : LOG:  aborting startup due to startup process failure
2009-11-01 16:09:10 PST : LOG:  logger shutting down

Shipped it with everything from AE-BB to wal_recovery.
It started in recovery mode asking for more WAL files.
I started applying wal files and everything OK. Recovery in progress.
When I feeded it with files up to ..2FB.08 (time around the oryginal data directory from warm standby server was copied) and triggered it came up online.
Can connect select on some but when selected on logging.agentpagehit (35GB+)  it crashed.
It throwed on console:

saturn=# select count(*) from logging.agentpagehit;
ERROR:  xlog flush request 2FB/45E1B8D0 is not satisfied --- flushed only to 2FB/8FFEA60
CONTEXT:  writing block 874822 of relation 1663/20863/21548

Now it is saying constantly in log :

2009-11-04 04:57:39 PST : ERROR:  XX000: xlog flush request 2FB/28CE63A8 is not satisfied --- flushed only to 2FB/8FFEA60
2009-11-04 04:57:39 PST : CONTEXT:  writing block 874937 of relation 1663/20863/21548
2009-11-04 04:57:39 PST : LOCATION:  XLogFlush, xlog.c:1865
2009-11-04 04:57:39 PST : WARNING:  58030: could not write block 874937 of 1663/20863/21548
2009-11-04 04:57:39 PST : DETAIL:  Multiple failures --- write error may be permanent.
2009-11-04 04:57:39 PST : LOCATION:  AbortBufferIO, bufmgr.c:2129

What am I missing?
- Should I ship it with more WAL files from the past/future (if future until when) ?
- Did 1st start without wal files broke it?
- Did start without pg_xlog files broke it?
- According to some post on the Web "Minimum recovery ending location:     186/80DCC48" means I should ship it with wal files since 188..80, is this correct?

I havent checked yet what is first file it is asking (%f) when started without any WAL files in wal_recovery, will know it in few hours as now copying data over once again.

Any thoughts?

Michal

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general