On Fri, Aug 29, 2014 at 2:11 PM, Andres Freund <andres@xxxxxxxxxxxxxxx> wrote:
We didn't have a failover. We just have one master and replica in different AZs on Amazon EC2.
On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:The problem is this log entry:
> Hi Craig -- Sorry for the late response, I've been tied up with some other
> things for the last day. Just to give some context, this is a machine that
> sits in our office and replicates from another read slave in production via
> a tunnel set up with spiped. The spiped tunnel is working and postgres is
> still stuck (it has been stuck since 8-25).
>
> The last moment that replication was working was 2014-08-25
> 22:06:05.03972. We have a table called replication_time with one column and
> one row that has a timestamp that is updated every second, so it's easy to
> tell the last time this machine was in sync with production.
>
> recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
> logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q
2014-08-27 18:44:27 PDT ERROR: requested starting point 175/28000000 is
ahead of the WAL flush position of this server 174/B76D16A8
That's the walreceiver connecting to the upstream node and askign for
wAL. Somehow the standby has gotten *considerably* ahead of the node
it's trying to receive WAL from.
Are you sure primary_conninfo / the spiped tunnel connects to the right
postgres instance? Did you possibly have a failover or something like that?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
I ran the following on the local endpoint of spiped:
while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location() from replication_time"; done;
And the same command on production and I was able to verify that the xlogs for a given point in time were the same (modtime is updated every second by an upstart job):
spiped from office -> production:
modtime | pg_last_xlog_receive_location | pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8
Ran directly on production replica:
modtime | pg_last_xlog_receive_location | pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8
To me, this is sufficient proof that spiped is indeed talking to the machine I think it's talking to (also lsof reports the correct hostname).
I created another basebackup from the currently stuck postgres intance on another machine and I also get this error:
2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream: ERROR: requested starting point 177/2D000000 is ahead of the WAL flush position of this server 174/B76D16A8
However, this new instance is able to fetch logs from S3 and replay them without issue.
Is it possible that the data dir on the stuck instance is just corrupt? It is not impossible for this to have happened at some point in the past due to the race condition in fetching logs from S3 I mentioned above.