Re: WAL receive process dies

Patrick Krecker <patrick@xxxxxxxxxxxx> · Mon, 1 Sep 2014 16:53:03 -0700

On Sat, Aug 30, 2014 at 7:04 AM, Chris Hundt <chris@xxxxxxxxxxxx> wrote:

(One of Patrick's coworkers here; thanks a lot for your assistance)

Just in case you wanted this as well, I ran

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

(5432 is the stuck replica and 5445 is the pipe to the working replica)

      systemid       | timeline |   xlogpos    

---------------------+----------+--------------
 5964163898407843711 |        1 | 174/B76D16A8

On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patrick@xxxxxxxxxxxx> wrote:

On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres@xxxxxxxxxxxxxxx> wrote:

[FWIW: proper quoting makes answering easier and thus more likely]

On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:

> I ran the following on the local endpoint of spiped:

>

> while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select

> modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()

> from replication_time";  done;

>

> And the same command on production and I was able to verify that the xlogs

> for a given point in time were the same (modtime is updated every second by

> an upstart job):

>

> spiped from office -> production:

>           modtime           | pg_last_xlog_receive_location |

> pg_last_xlog_replay_location

> ----------------------------+-------------------------------+------------------------------

>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8

>

> Ran directly on production replica:

>           modtime           | pg_last_xlog_receive_location |

> pg_last_xlog_replay_location

> ----------------------------+-------------------------------+------------------------------

>  2014-08-29 15:23:25.563766 | 177/2E80C9F8                  | 177/2E80C9F8

>

> To me, this is sufficient proof that spiped is indeed talking to the

> machine I think it's talking to (also lsof reports the correct hostname).

>

> I created another basebackup from the currently stuck postgres intance on

> another machine and I also get this error:

>

> 2014-08-29 15:27:30 PDT FATAL:  could not receive data from WAL stream:

> ERROR:  requested starting point 177/2D000000 is ahead of the WAL flush

> position of this server 174/B76D16A8

Uh. this indicates that the machine you're talking to is *not* one of

the above as it has a flush position of '174/B76D16A8' - not something

that's really possible when the node actually is at '177/2E80C9F8'.

Could you run, on the standby that's having problems, the following

command:

psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

Greetings,

Andres Freund

--

 Andres Freund                     http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing something else wrong? Sorry :(

First, I apologize for the misleading information, but when I made another basebackup and tried to use it, I configured the machine to cascade from the stuck replica, *not* from the spiped endpoint. When I properly connected it to the spiped endpoint it synced up fine, giving this log line:

2014-08-29 16:16:21 PDT LOG:  started streaming WAL from primary at 177/4F000000 on timeline 1

The command as you gave reported a syntax error as is, but I googled a little bit and run this one:

psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX password=XXX' -c 'IDENTIFY_SYSTEM;'

And it gave me this output:

      systemid       | timeline |   xlogpos

---------------------+----------+--------------
 5964163898407843711 |        1 | 177/53091990

I I restarted the server and it promptly started fetching WAL archives from our backups [1]. I am still at a loss as to why the binary replication couldn't reconnect and why Postgres didn't fall back to fetching WAL archives with its restore_command.

[1] http://pastebin.com/ZxM1LsNw