Re: Streaming Replication Randomly Locking Up

Andrew Berman <rexxe98@xxxxxxxxx> · Thu, 15 Aug 2013 12:22:49 -0700

The only thing I see that is a possibility for the issue is in the slave log:
LOG:  unexpected EOF on client connection
LOG:  could not receive data from client: Connection reset by peer

I don't know if that's related or not as it could just be somebody running a query.  The log file does seem to be riddled with these but the replication failures don't happen constantly.

As far as I know I'm not swallowing any errors.  The logging is all set as the default:

log_destination = 'stderr'
logging_collector = on

#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off

#log_disconnections = off
#log_error_verbosity = default

I'm going to have a look at the NICs to make sure there's no issue there.

Thanks again for your help!

On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@xxxxxxxxx> wrote:

Are you certain that there are no relevant errors in the database logs

(on both master & slave)?  Also, are you sure that you didn't

misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@xxxxxxxxx> wrote:

> Hi Lonni,

>

> Yes, I am using PG 9.1.9.

> Yes, 1 slave syncing from the master

> CentOS 6.4

> I don't see any network or hardware issues (e.g. NIC) but will look more

> into this.  They are communicating on a private network and switch.

>

> I forgot to mention that after I restart the slave, everything syncs right

> back up and all if working again so if it is a network issue, the

> replication is just stopping after some hiccup instead of retrying and

> resuming when things are back up.

>

> Thanks!

>

>

>

> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@xxxxxxxxx>

> wrote:

>>

>> I've never seen this happen.  Looks like you might be using 9.1?  Are

>> you up to date on all the 9.1.x releases?

>>

>> Do you have just 1 slave syncing from the master?

>> Which OS are you using?

>> Did you verify that there aren't any network problems between the

>> slave & master?

>> Or hardware problems (like the NIC dying, or dropping packets)?

>>

>>

>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@xxxxxxxxx> wrote:

>> > Hello,

>> >

>> > I'm having an issue where streaming replication just randomly stops

>> > working.

>> > I haven't been able to find anything in the logs which point to an

>> > issue,

>> > but the Postgres process shows a "waiting" status on the slave:

>> >

>> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54

>> > postgres:

>> > startup process   recovering 000000010000053D0000003F waiting

>> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30

>> > postgres:

>> > writer process

>> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03

>> > postgres:

>> > stats collector process

>> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31

>> > postgres:

>> > wal receiver process   streaming 549/216B3730

>> >

>> > The replication works great for days, but randomly seems to lock up and

>> > replication halts.  I verified that the two databases were out of sync

>> > with

>> > a query on both of them.  Has anyone experienced this issue before?

>> >

>> > Here are some relevant config settings:

>> >

>> > Master:

>> >

>> > wal_level = hot_standby

>> > checkpoint_segments = 32

>> > checkpoint_completion_target = 0.9

>> > archive_mode = on

>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f

>> > </dev/null'

>> > max_wal_senders = 2

>> > wal_keep_segments = 32

>> >

>> > Slave:

>> >

>> > wal_level = hot_standby

>> > checkpoint_segments = 32

>> > #checkpoint_completion_target = 0.5

>> > hot_standby = on

>> > max_standby_archive_delay = -1

>> > max_standby_streaming_delay = -1

>> > #wal_receiver_status_interval = 10s

>> > #hot_standby_feedback = off

>> >

>> > Thank you for any help you can provide!

>> >

>> > Andrew

>> >