Re: WAL receive process dies

Patrick Krecker <patrick@xxxxxxxxxxxx> · Fri, 29 Aug 2014 13:04:43 -0700

Hi Craig -- Sorry for the late response, I've been tied up with some other things for the last day. Just to give some context, this is a machine that sits in our office and replicates from another read slave in production via a tunnel set up with spiped. The spiped tunnel is working and postgres is still stuck (it has been stuck since 8-25).

The last moment that replication was working was  2014-08-25 22:06:05.03972. We have a table called replication_time with one column and one row that has a timestamp that is updated every second, so it's easy to tell the last time this machine was in sync with production.

recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

Currently the WAL receive process is still not running. Interestingly, another pg instance running on the same machine is replicating just fine.

A note about that: there is another instance running on that machine and a definite race condition with restore_wal_s3.py, which writes the file to /tmp before copying it to the destination requested by postgres (I just discovered this today, this is not generally how we run our servers). So, if both are restoring at the same time, they will step on the WAL archives being unzipped in /tmp and bad things will happen. But, interestingly, I checked the logs for the other machine and there is no activity on that day. It does not appear that the WAL replay was invoked or that the WAL receive timed out.

As for enabling the core dump, it seems that it needs to be done when Postgres starts, and thought I would leave it running in its "stuck" state for now. However, if you know how to enable it on a running process, let me know. We are running Ubuntu 13.10.

On Wed, Aug 27, 2014 at 11:30 PM, Craig Ringer <craig@xxxxxxxxxxxxxxx> wrote:

On 08/28/2014 09:39 AM, Patrick Krecker wrote:

> We have a periodic network connectivity issue (unrelated to Postgres)

> that is causing the replication to fail.

>

> We are running Postgres 9.3 using streaming replication. We also have

> WAL archives available to be replayed with restore_command. Typically

> when I bring up a slave it copies over WAL archives for a while before

> connecting via streaming replication.

>

> When I notice the machine is behind in replication, I also notice that

> the WAL receiver process has died. There didn't seem to be any

> information in the logs about it.

What did you search for?

Do you have core dumps enabled? That'd be a good first step. (Exactly

how to do this depends on the OS/distro/version, but you basically want

to set "ulimit -c unlimited" on some ancestor of the postmaster).

> 1. It seems that Postgres does not fall back to copying WAL archives

> with its restore_command. I just want to confirm that this is what

> Postgres is supposed to do when its connection via streaming replication

> times out.

It should fall back.

> 2. Is it possible to restart replication after the WAL receiver process

> has died without restarting Postgres?

PostgreSQL should do so its self.

Please show your recovery.conf (appropriately redacted) and

postgresql.conf for the replica, and complete logs for the time period

of interest. You'll want to upload the logs somewhere then link to them,

do not attach them to an email to the list.

--

 Craig Ringer                   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services