Re: Perplexing, regular decline in performance

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Tue, 25 Jun 2019 11:23:38 -0500

On Tue, Jun 25, 2019 at 11:49:03AM -0400, Hugh Ranalli wrote:
> I'm hoping people can help me figure out where to look to solve an odd
> PostgreSQL performance problem.

What kernel?  Version?  OS?

If Linux, I wonder if transparent hugepages or KSM are enabled ?  It seems
possible that truncating the table is clearing enough RAM to mitigate the
issue, similar to restarting the DB.
tail /sys/kernel/mm/ksm/run /sys/kernel/mm/transparent_hugepage/khugepaged/defrag /sys/kernel/mm/transparent_hugepage/enabled /sys/kernel/mm/transparent_hugepage/defrag
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

11.2 would have parallel query, and enabled by default.  Are there other
settings you've changed (or not changed)?
https://wiki.postgresql.org/wiki/Server_Configuration

It's possible that the "administrative" queries are using up lots of your
shared_buffers, which are (also/more) needed by the customer-facing queries.  I
would install pg_buffercache to investigate.  Or, just pause the admin queries
and see if that the issue goes away during that interval ?

SELECT 1.0*COUNT(1)/sum(count(1))OVER(), COUNT(1), COUNT(nullif(isdirty,'f')), datname, COALESCE(c.relname, b.relfilenode::text), d.relname TOAST, 1.0*COUNT(nullif(isdirty,'f'))/count(1) dirtyfrac, avg(usagecount) FROM pg_buffercache b JOIN pg_database db ON b.reldatabase=db.oid LEFT JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) LEFT JOIN pg_class d ON c.oid=d.reltoastrelid GROUP BY 4,5,6 ORDER BY 1 DESC LIMIT 9; 

Could you send query plan for the slow (customer-facing) queries?
https://wiki.postgresql.org/wiki/Slow_Query_Questions#EXPLAIN_.28ANALYZE.2C_BUFFERS.29.2C_not_just_EXPLAIN

> A bit of background: We have a client with a database of approximately 450
> GB, that has a couple of tables storing large amounts of text, including
> full HTML pages from the Internet. Last fall, they began experiencing
> dramatic and exponentially decreasing performance. We track certain query
> times, so we know how much time is being spent in calls to the database for
> these functions. When this began, the times went from about an average of
> approximate 200 ms to 400 ms, rapidly climbing each day before reaching 900
> ms, figures we had never seen before, within 4 days, with no appreciable
> change in usage. It was at this point that we restarted the database server
> and times returned to the 400 ms range, but never back to their
> long-running original levels. From this point onward, we had to restart the
> database (originally the server, but eventually just the database process)
> every 3-4 days, otherwise the application became unusable.
> 
> As they were still on PostgreSQL 8.2, we persuaded them to finally
> undertake our long-standing recommendation to upgrade, as there was no
> possibility of support on that platform. That upgrade to 11.2 was completed
> successfully in mid-May, and although times have not returned to their
> original levels (they now average approximately 250 ms), the application
> overall seems much more responsive and faster (application servers were not
> changed, other than minor changes --full text search, explicit casts,
> etc.-- to conform to PostgreSQL 11's requirements).
> 
> What we continued to notice was a milder but still definite trend of
> increased query times, during the course of each week, from the mid to high
> 200 ms, to the high 300 ms to low 400 ms. Some years ago, someone had
> noticed that as the number of "raw_page" columns in a particular table
> grew, performance would decline. They wrote a script that once a week locks
> the table, deletes the processed large columns (they are not needed after
> processing), copies the remaining data to a backup table, truncates the
> original table, then copies it back. When this script runs we see an
> immediate change in performance, from 380 ms in the hour before the drop,
> to 250 ms in the hour of the drop. As rows with these populated columns are
> added during the course of a week, the performance drops, steadily, until
> the next week's cleaning operation. Each week the performance increase is
> clear and significant.
> 
> What is perplexing is (and I have triple checked), that this table is *not*
> referenced in any way in the queries that we time (it is referenced by
> ongoing administrative and processing queries). The operation that cleans
> it frees up approximately 15-20 GB of space each week. Our system
> monitoring shows this change in free disk space, but this is 20 GB out of
> approximately 300 GB of free space (free space is just under 40% of volume
> size), so disk space does not seem to be an issue. The table in question is
> about 21 GB in size, with about 20 GB in toast data, at its largest.
> 
> Even odder, the queries we time *do* reference a much larger table, which
> contains very similar data, and multiple columns of it. It is 355 GB in
> size, with 318 GB in toast data. It grows continually, with no cleaning.
> 
> If anyone has any suggestions as to what sort of statistics to look at, or
> why this would be happening, they would be greatly appreciated.