Hi,
--
2014-09-05 9:33 GMT+02:00 Alexey Klyukin <alexk@xxxxxxxxxxxx>:
Greetings,
We've got a 9.3.5 DB running in a standby mode for a fairly large DB
(500GB) with a busy WAL traffic (couple of GBs per hour) and it
occasionally 'forgets' to remove the segments it restored.
The checkpoint_segments is set to 128, and usually we observe around
270 segments accumulated, but at the time it happens our check
triggers at around 2K segments. The manual checkpoint command takes
ages to complete there, the fast shutdown is very slow (around 10
minutes, usually less than 1 minute) and the WAL receiver process is
also unable to run for some reason.
The only way to make this host delete WAL files is to restart . The
particularly notable restart point right after the shutdown shows
quite a number of removed files and buffers written (the shared
buffers is set to 8GB on this system):
2014-09-04 14:39:33.376 CEST,,,22354,,537a4553.5752,88217,,2014-05-19
19:54:27 CEST,,0,LOG,00000,"restartpoint complete: wrote 332473
buffers (31.7%); 0 transaction log file(s) added, 1237 removed, 6
recycled; write=9.745 s, sync=680.314 s, total=694.447 s; sync
files=499
, longest=37.774 s, average=1.363 s",,,,,,,,,""
If we leave the host running, this restartpoint never happens.
The only difference I can come up with from the other databases that
do not show this behavior is that the host is running with
max_standby_streaming_delay and max_standby_archive_delay set to -1,
but at the time we observed the problem no queries were running on it
at all.
The problem occurs rarely, but steadily, around once every 3 months.
During this time the PostgreSQL has been upgraded from 9.0 to 9.3,
which did not solve the issue.
Any clues on how can we debug and diagnose the problem further to come
up with a proper bug report, if it is a bug, or are we missing
something in the configuration that causes this?
I have no direct answer for you, but we seem to have the same issue for two of our customers. We are on 9.2.8 on one of them. Do you know if you have the .ready related files in the archive_status directory? are they old WAL files? can you tell us their names?
We're still investigating the issue. Not that it's a real issue, but it's still weird. And we'd like to understand what's happening.
--