On Fri, Sep 6, 2013 at 1:26 PM, John Lumby <johnlumby@xxxxxxxxxxx> wrote: > We use logshipping replication, and have recently noticed a nasty bug > where, in certain very rare cases, the primary archive_command program > will fail to send the WAL file to the standby but report good return code 0 to postgresql. > In such cases, if the standby then triggers its termination of recovery mode, > it will come up in normal accessible mode but missing the log records from that last WAL file. > > This is a bug in our code which we will fix, but I am wondering if it means there is a possibility > of worse than missing some updates. I.e. could it result in this was-standby cluster now having > a corrupt database (e.g. an index entry with no matching heap slot or something like that - or worse)? As long as the standby ever reached consistency in the first place, then it should not lose it due to this issue. Once consistency is reached, changes to the data files are driven only by replay of the WAL records, and those should only take the database from one consistent state to another. Where you risk corruption is if the problem occured while you are taking the base backup. Then some of the base files that were copied might already have data in them which is from the "future", but that future cannot be reached because recovery stops early due to the lost file. The database should detect this situation and refuse to start, forcing you to retake the base backup or use an earlier one. But there were known bugs in this general area, some fixed in 9.2.3. Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general