Baptiste LHOSTE wrote: >> Just so we know how to interpret that, how many minutes, hours, >> or days did you wait to see whether it would ever end? > > I have waiting for 15 minutes in this state. I can not wait more > time without losing some data for our client. Thanks. I wasn't suggesting you increase the duration; I just wanted perspective on whether it could be the result of unusually long run times rather than blocking, and how severe that increase was known ot be. >> If it doesn't cause too much pain to let it get into this state >> for a few minutes, it might help diagnose the issue if you could >> start `vmstat 1` before you let it get into this state, and >> capture `ps aux | postgres`, pg_stat_activity, and pg_locks at >> intervals while it is in this state. Looking at all of the above >> might suggest a cause. If we can find the cause, we can almost >> certainly fix it. > [information captured as requested] Thank you very much, With that much information we should be much better able to get a sense of the nature of the problem. It will take a while to sift through it and properly analyze it. But even on a first pass I think there is a lot of valuable information that jumps out: (1) There is no sign of unusual pressure on OS memory; OS cache usage remains stable from before the incident to the end of the monitored period. (2) Disk wait time climbed early in the incident and remained high to the end. (3) Disk read volume *before* the incident shows a peak of about the same as during the incident, with somewhat lower disk wait time. (Do we know what was causing that? It ended before the other results were captured.) (4) Not a single incident of blocking was captured in any of the lists of pg_stat_activity or pg_locks. (5) The TRUNCATE statements completed and moved on to CREATE INDEX, which continued to accrue CPU time during the episode. (6) Of the three autovacuum worker processes, two were running just an ANALYZE on every sample, and were moving on from table to table. (7) One autovacuum process was running VACUUM ANALYZE against a single table for the entire duration of the incident. It was slowly accumulating CPU time during the incident. On the face of it, it appears that with your normal production settings your storage system is right at the edge of what it can handle, and making autovacuum more aggressive to try to keep the statistics on the second type of table more up-to-date is pushing the load past its capacity. You might be able to change the autovacuum thresholds and scale factors without changing autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit (or making smaller changes to them). You could probably throw hardware at it to fix the problem. Even with settings which work fine when everything is up-to-date you may experience some impact on production when you frist turn it on and autovacuum is attempting to "catch up". I'm not actually clear, when I look back, at what the problem is that you are trying to solve -- you say that a particular type of query is running for 2 to 3 minutes, and note that statistics on a particular type of table are only being re-sampled once every 5 to 6 days. It's not clear that more frequent statistical sampling of the tables would change the plans. Perhaps you should post one such query to the performance list, with supporting data, and see whether someone can suggest a way to speed that query. http://wiki.postgresql.org/wiki/SlowQueryQuestions -Kevin -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin