Re: Standby is not removing restored WAL segments

Guillaume Lelarge <guillaume@xxxxxxxxxxxx> · Mon, 15 Sep 2014 17:33:32 +0200

Hi,

2014-09-05 9:33 GMT+02:00 Alexey Klyukin <alexk@xxxxxxxxxxxx>:
Greetings,

We've got a 9.3.5 DB running in a standby mode for a fairly large DB

(500GB) with a busy WAL traffic (couple of GBs per hour) and it

occasionally 'forgets' to remove the segments it restored.

The checkpoint_segments is set to 128, and usually we observe around

270 segments accumulated, but at the time it happens our check

triggers at around 2K segments. The manual checkpoint command takes

ages to complete there,  the fast shutdown is very slow (around 10

minutes, usually less than 1 minute) and the WAL receiver process is

also unable to run for some reason.

The only way to make this host delete WAL files is to restart . The

particularly notable restart point right after the shutdown shows

quite a number of removed files and buffers written (the shared

buffers is set to 8GB on this system):

2014-09-04 14:39:33.376 CEST,,,22354,,537a4553.5752,88217,,2014-05-19

19:54:27 CEST,,0,LOG,00000,"restartpoint complete: wrote 332473

buffers (31.7%); 0 transaction log file(s) added, 1237 removed, 6

recycled; write=9.745 s, sync=680.314 s, total=694.447 s; sync

files=499

, longest=37.774 s, average=1.363 s",,,,,,,,,""

If we leave the host running, this restartpoint never happens.

The only difference I can come up with from the other databases that

do not show this behavior is that the host is running with

max_standby_streaming_delay and max_standby_archive_delay set to -1,

but at the time we observed the problem no queries were running on it

at all.

The problem occurs rarely, but steadily, around once every 3 months.

During this time the PostgreSQL has been upgraded from 9.0 to 9.3,

which did not solve the issue.

Any clues on how can we debug and diagnose the problem further to come

up with a proper bug report, if it is a bug, or are we missing

something in the configuration that causes this?

I have no direct answer for you, but we seem to have the same issue for two of our customers. We are on 9.2.8 on one of them. Do you know if you have the .ready related files in the archive_status directory? are they old WAL files? can you tell us their names?

We're still investigating the issue. Not that it's a real issue, but it's still weird. And we'd like to understand what's happening.

-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com