Re: Can someone help explain what's going on from the attached logs?

Chris Redekop <chris@xxxxxxxxxxxx> · Wed, 26 Oct 2011 11:41:38 -0600

Caveat #2 applies here
http://developer.postgresql.org/pgdocs/postgres/hot-standby.html#HOT-STANDBY-CAVEATS

The consistent state is delayed until your long running transactions

end, which is workload dependent but transient.

I'm not quite sure how this correlates to what I'm seeing in 9.1.1.  When attempting to start a hotstandby while the primary is under load it seems to get stuck in that 'starting up' mode even when there are no open transactions.  If I get it into this state, and then remove all the load from the primary it still will not finish starting up.  If I select from pg_stat_activity on the primary it shows a couple connections, but they all have null 'xact_start's and <IDLE> 'current_query's.  Even if I then kill all the connections to the primary (via pg_terminate_backend) the hotstandby still will not finish starting up.  I would assume there can't be any transactions in progress if there are no connections to the primary.  Attempting to restart the hotstandby when in this state produces the same result.  If there is a transaction in progress for the duration of the backup (or something like that) can it cause it to get into this state?

Here is a debug3 log of this startup from 9.1.1:

2011-10-26 11:25:32.931 MDT [22640]: [1-1] LOG:  database system was shut down in recovery at 2011-10-26 11:25:02 MDT
2011-10-26 11:25:32.931 MDT [22640]: [2-1] DEBUG:  standby_mode = 'on'
2011-10-26 11:25:32.931 MDT [22640]: [3-1] DEBUG:  primary_conninfo = 'host=othernode port=5432 user=postgres application_name=sync_rep_test'
2011-10-26 11:25:32.931 MDT [22640]: [4-1] DEBUG:  trigger_file = '/db/data/trigger_file'
2011-10-26 11:25:32.931 MDT [22640]: [5-1] LOG:  entering standby mode
2011-10-26 11:25:32.931 MDT [22640]: [6-1] DEBUG:  checkpoint record is at 7/96667628
2011-10-26 11:25:32.931 MDT [22640]: [7-1] DEBUG:  redo record is at 7/960000A8; shutdown FALSE
2011-10-26 11:25:32.931 MDT [22640]: [8-1] DEBUG:  next transaction ID: 0/11855637; next OID: 3219844

2011-10-26 11:25:32.931 MDT [22640]: [9-1] DEBUG:  next MultiXactId: 1895; next MultiXactOffset: 3810
2011-10-26 11:25:32.931 MDT [22640]: [10-1] DEBUG:  oldest unfrozen transaction ID: 1669, in database 1
2011-10-26 11:25:32.931 MDT [22640]: [11-1] DEBUG:  transaction ID wrap limit is 2147485316, limited by database with OID 1
2011-10-26 11:25:32.932 MDT [22640]: [12-1] DEBUG:  resetting unlogged relations: cleanup 1 init 0
2011-10-26 11:25:32.939 MDT [22640]: [13-1] DEBUG:  initializing for hot standby
2011-10-26 11:25:32.939 MDT [22640]: [14-1] LOG:  redo starts at 7/960000A8
2011-10-26 11:25:33.027 MDT [22640]: [15-1] DEBUG:  running transaction data initialized
2011-10-26 11:25:33.027 MDT [22640]: [16-1] CONTEXT:  xlog redo  running xacts: nextXid 11855637 latestCompletedXid 11855636 oldestRunningXid 11855250; 195 xacts: 11855359 11855360 ... 11855444 11855445; subxid ovf
2011-10-26 11:25:33.028 MDT [22640]: [17-1] DEBUG:  0 KnownAssignedXids (num=0 tail=0 head=0)
2011-10-26 11:25:33.028 MDT [22640]: [18-1] CONTEXT:  xlog redo  running xacts: nextXid 11855637 latestCompletedXid 11855636 oldestRunningXid 11855250; 195 xacts: 11855359 11855360 ... 11855444 11855445; subxid ovf
2011-10-26 11:25:33.028 MDT [22640]: [19-1] LOG:  consistent state delayed because recovery snapshot incomplete
2011-10-26 11:25:33.028 MDT [22640]: [20-1] CONTEXT:  xlog redo  running xacts: nextXid 11855637 latestCompletedXid 11855636 oldestRunningXid 11855250; 195 xacts: 11855359 11855360 ... 11855444 11855445; subxid ovf
2011-10-26 11:25:33.514 MDT [22640]: [21-1] LOG:  consistent recovery state reached at 7/98480E28
2011-10-26 11:25:33.515 MDT [22640]: [22-1] LOG:  record with zero length at 7/98480E28
2011-10-26 11:25:33.515 MDT [22640]: [23-1] DEBUG:  could not open file "pg_xlog/000000010000000700000098" (log file 7, segment 152): No such file or directory
2011-10-26 11:25:33.515 MDT [22642]: [1-1] DEBUG:  find_in_dynamic_libpath: trying "/usr/local/postgresql-9.1.1/lib/libpqwalreceiver"
2011-10-26 11:25:33.515 MDT [22642]: [2-1] DEBUG:  find_in_dynamic_libpath: trying "/usr/local/postgresql-9.1.1/lib/libpqwalreceiver.so"
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 7/98000000
2011-10-26 11:25:33.519 MDT [22642]: [3-1] LOG:  streaming replication successfully connected to primary
2011-10-26 11:25:33.519 MDT [22642]: [4-1] DEBUG:  sending write 0/0 flush 0/0 apply 7/98480E28
2011-10-26 11:25:33.522 MDT [22642]: [5-1] DEBUG:  sending write 7/98020000 flush 0/0 apply 7/98480E28

2011-10-26 11:25:33.568 MDT [22642]: [6-1] DEBUG:  sending write 7/98020000 flush 7/98020000 apply 7/98480E28
2011-10-26 11:25:33.569 MDT [22642]: [7-1] DEBUG:  sending write 7/98040000 flush 7/98020000 apply 7/98480E28
...
2011-10-26 11:25:33.924 MDT [22642]: [49-1] DEBUG:  sending write 7/98460000 flush 7/98400000 apply 7/98480E28
2011-10-26 11:25:33.943 MDT [22642]: [50-1] DEBUG:  sending write 7/98460000 flush 7/98460000 apply 7/98480E28
DEBUG:  standby "sync_rep_test" has now caught up with primary
2011-10-26 11:25:33.943 MDT [22642]: [51-1] DEBUG:  sending write 7/98480E28 flush 7/98460000 apply 7/98480E28
2011-10-26 11:25:33.959 MDT [22642]: [52-1] DEBUG:  sending write 7/98480E28 flush 7/98480E28 apply 7/98480E28
2011-10-26 11:25:34.192 MDT [22634]: [33-1] DEBUG:  forked new backend, pid=22644 socket=7
2011-10-26 11:25:34.192 MDT [22644]: [1-1] FATAL:  the database system is starting up
2011-10-26 11:25:34.192 MDT [22644]: [2-1] DEBUG:  shmem_exit(1): 0 callbacks to make
2011-10-26 11:25:34.192 MDT [22644]: [3-1] DEBUG:  proc_exit(1): 1 callbacks to make
2011-10-26 11:25:34.192 MDT [22644]: [4-1] DEBUG:  exit(1)
2011-10-26 11:25:34.192 MDT [22644]: [5-1] DEBUG:  shmem_exit(-1): 0 callbacks to make
2011-10-26 11:25:34.192 MDT [22644]: [6-1] DEBUG:  proc_exit(-1): 0 callbacks to make
2011-10-26 11:25:34.192 MDT [22634]: [34-1] DEBUG:  server process (PID 22644) exited with exit code 1
2011-10-26 11:25:35.171 MDT [22634]: [35-1] DEBUG:  forked new backend, pid=22655 socket=7
2011-10-26 11:25:35.171 MDT [22655]: [1-1] FATAL:  the database system is starting up
2011-10-26 11:25:35.171 MDT [22655]: [2-1] DEBUG:  shmem_exit(1): 0 callbacks to make
2011-10-26 11:25:35.171 MDT [22655]: [3-1] DEBUG:  proc_exit(1): 1 callbacks to make
2011-10-26 11:25:35.171 MDT [22655]: [4-1] DEBUG:  exit(1)
2011-10-26 11:25:35.171 MDT [22655]: [5-1] DEBUG:  shmem_exit(-1): 0 callbacks to make
2011-10-26 11:25:35.171 MDT [22655]: [6-1] DEBUG:  proc_exit(-1): 0 callbacks to make
2011-10-26 11:25:35.172 MDT [22634]: [36-1] DEBUG:  server process (PID 22655) exited with exit code 1
and so on...