Re: Autovacuum issues with truncate and create index ...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Baptiste LHOSTE wrote:

>> Just so we know how to interpret that, how many minutes, hours,
>> or days did you wait to see whether it would ever end?
> 
> I have waiting for 15 minutes in this state. I can not wait more
> time without losing some data for our client.

Thanks. I wasn't suggesting you increase the duration; I just
wanted perspective on whether it could be the result of unusually
long run times rather than blocking, and how severe that increase
was known ot be.

>> If it doesn't cause too much pain to let it get into this state
>> for a few minutes, it might help diagnose the issue if you could
>> start `vmstat 1` before you let it get into this state, and
>> capture `ps aux | postgres`, pg_stat_activity, and pg_locks at
>> intervals while it is in this state. Looking at all of the above
>> might suggest a cause. If we can find the cause, we can almost
>> certainly fix it.

> [information captured as requested]

Thank you very much, With that much information we should be much
better able to get a sense of the nature of the problem. It will
take a while to sift through it and properly analyze it. But even
on a first pass I think there is a lot of valuable information that
jumps out:

(1) There is no sign of unusual pressure on OS memory; OS cache
usage remains stable from before the incident to the end of the
monitored period.

(2) Disk wait time climbed early in the incident and remained high
to the end.

(3) Disk read volume *before* the incident shows a peak of about
the same as during the incident, with somewhat lower disk wait
time. (Do we know what was causing that? It ended before the other
results were captured.)

(4) Not a single incident of blocking was captured in any of the
lists of pg_stat_activity or pg_locks.

(5) The TRUNCATE statements completed and moved on to CREATE INDEX,
which continued to accrue CPU time during the episode.

(6) Of the three autovacuum worker processes, two were running just
an ANALYZE on every sample, and were moving on from table to table.

(7) One autovacuum process was running VACUUM ANALYZE against a
single table for the entire duration of the incident. It was slowly
accumulating CPU time during the incident.

On the face of it, it appears that with your normal production
settings your storage system is right at the edge of what it can
handle, and making autovacuum more aggressive to try to keep the
statistics on the second type of table more up-to-date is pushing
the load past its capacity. You might be able to change the
autovacuum thresholds and scale factors without changing
autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit (or
making smaller changes to them). You could probably throw hardware
at it to fix the problem. Even with settings which work fine when
everything is up-to-date you may experience some impact on
production when you frist turn it on and autovacuum is attempting
to "catch up".

I'm not actually clear, when I look back, at what the problem is
that you are trying to solve -- you say that a particular type of
query is running for 2 to 3 minutes, and note that statistics on a
particular type of table are only being re-sampled once every 5 to
6 days. It's not clear that more frequent statistical sampling of
the tables would change the plans. Perhaps you should post one such
query to the performance list, with supporting data, and see
whether someone can suggest a way to speed that query.

http://wiki.postgresql.org/wiki/SlowQueryQuestions

-Kevin


-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux