What am I doing wrong?

François Beausoleil <francois@xxxxxxxxxxx> · Mon, 24 Sep 2012 20:51:11 -0400

I'm in the single-slave scenario, with hot standby capabilities, meaning I want to run queries on the slave. I'm running some tests to evaluate pgbarman, on Ubuntu 11.10. I used only packaged PostgreSQL, and I'm running version "PostgreSQL 9.1.5 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit". Both the master and the slave are running on the same host.

master/postgresql.conf

port = 5432
archive_mode = on
wal_level = hot_standby
max_wal_senders = 3
wal_keep_segments = 256
archive_command = '/bin/cp --verbose %p /var/pgexchange/%f'

master/pg_hba.conf (as I said, testing config only):

host    replication     postgres        127.0.0.1/32            trust

slave/postgrseql.conf:
port = 5433
hot_standby = on
hot_standby_feedback = on
max_standby_archive_delay = -1
max_standby_streaming_delay = -1

slave/pg_hba.conf -- all at default

/var/lib/postgresql/9.1/slave0/recovery.conf:

standby_mode = on
restore_command = '/bin/cp --verbose /var/pgexchange/%f %p' 
primary_conninfo = 'host=localhost port=5432 user=postgres password=supersecretpassword'

The slave's log says it's connected to the master, but I can't connect.

# psql -h localhost -p 5433 -U postgres -d mydb
psql: FATAL:  the database system is starting up
FATAL:  the database system is starting up

The slave's log, after a fresh pg_basebackup + restore for the slave, contains:

==> /var/log/postgresql/postgresql-9.1-slave0.log <==
2012-09-25 00:46:22 UTC LOG:  database system was interrupted; last known up at 2012-09-25 00:44:20 UTC
2012-09-25 00:46:22 UTC LOG:  creating missing WAL directory "pg_xlog/archive_status"
2012-09-25 00:46:22 UTC LOG:  entering standby mode
`/var/pgexchange/000000010000000000000016' -> `pg_xlog/RECOVERYXLOG'
2012-09-25 00:46:22 UTC LOG:  restored log file "000000010000000000000016" from archive
2012-09-25 00:46:23 UTC LOG:  redo starts at 0/16000020
2012-09-25 00:46:23 UTC LOG:  consistent recovery state reached at 0/17000000
/bin/cp: cannot stat `/var/pgexchange/000000010000000000000017': No such file or directory
2012-09-25 00:46:23 UTC LOG:  incomplete startup packet
2012-09-25 00:46:23 UTC LOG:  streaming replication successfully connected to primary
2012-09-25 00:46:23 UTC FATAL:  the database system is starting up
2012-09-25 00:46:24 UTC FATAL:  the database system is starting up
2012-09-25 00:46:24 UTC FATAL:  the database system is starting up

The "system is starting up" are the result of the pg_ctlcluster script which attempts to connect to the database to check if the server's up and available. According to the log above, a consistent state is reached, and the slave connects to the primary. During the slave's reconnection, the master emits no messages.

On the master, pg_stat_replication looks fine:

# select * from pg_stat_replication ;
 procpid | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
   27920 |       10 | postgres | walreceiver      | 127.0.0.1   |                 |       52193 | 2012-09-25 00:46:23.100631+00 | streaming | 0/17000000    | 0/17000000     | 0/17000000     | 0/17000000      |             0 | async

state == streaming; sent == write == flush == replay, so the slave seems to be consistent.

What am I missing here?

Thanks!
François

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general