Re: Perplexing, regular decline in performance

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Wed, 26 Jun 2019 19:24:42 -0500

On Wed, Jun 26, 2019 at 03:00:43PM -0400, Hugh Ranalli wrote:
> Pausing the admin queries isn't an option in our environment, especially as
> the issue reveals itself over the course of days, not minutes or hours.

Perhaps you can pause it for a short while at EOW and see if there's a dramatic
improvement ?

>         ?column?        |  count  | count |  datname  |        coalesce     |     toast      |         dirtyfrac          |        avg
> ------------------------+---------+-------+-----------+-------------------------+----------------+----------------------------+--------------------
>  0.24904101286779650995 | 1044545 |     0 | mydb      | position            |                | 0.000000000000000000000000 | 4.8035517857057379
>  0.16701241622795295199 |  700495 |     0 | mydb      | stat_position_click |                | 0.000000000000000000000000 | 1.9870234619804567
>  0.09935032779251879171 |  416702 |  6964 | mydb      | pg_toast_19788      | harvested_job  |     0.01671218280689797505 | 1.9346079452462431
>  0.06979762146872315533 |  292750 |     0 | mydb      | url                 |                | 0.000000000000000000000000 | 4.9627873612297182
>  0.03795774662998486745 |  159205 |     0 | mydb      | stat_sponsored_position |                | 0.000000000000000000000000 | 1.8412361420809648
>  0.02923155381784048663 |  122605 |     0 | mydb      | pg_toast_20174      | page           | 0.000000000000000000000000 | 3.0259532645487541
>  0.02755283459406156353 |  115564 |     0 | mydb      | location            |                | 0.000000000000000000000000 | 4.9953532241874632
>  0.02015273698468076320 |   84526 |  1122 | mydb      | harvested_job       |                |     0.01327402219435439983 | 4.9922154130090150
>  0.01913348905375406298 |   80251 |     0 | mydb      | pg_toast_20257      | position_index | 0.000000000000000000000000 | 4.9880001495308470
> 
> harvested_job is the rapidly growing "problematic" table I am talking
> about. page is the 355 GB table that gets referenced on the public
> searches. I'll google, but is there a place I should look to understand
> what I am seeing here?

I should label the columns:
|buffer_fraction           | nbuffers| ndirty| datname   | relname             | toast         | dirtyfrac                  | avgusage

It looks like possibly harvested job is being index scanned, and its toast
table is using up many buffers.  At the EOW, maybe that number is at the
expense of more important data.  You could check pg_stat_user_tables/indexes
for stats on that.  Possibly you could make use of index-only scans using
covering indexes (pg11 supports INCLUDE).  Or maybe it's just too big (maybe it
should be partitioned or maybe index should be repacked?)

> Also, Should pg_buffercache perhaps be run at the beginning and end of the
> week, to see if there is a significant difference?

Yes; buffercache can be pretty volatile, so I'd save it numerous times each at
beginning and end of week.

> > Could you send query plan for the slow (customer-facing) queries?
> >
> > https://wiki.postgresql.org/wiki/Slow_Query_Questions#EXPLAIN_.28ANALYZE.2C_BUFFERS.29.2C_not_just_EXPLAIN
> 
> I can, but can I ask why this would matter?

My very tentative guess is that harvested_job itself isn't the issue, but some
other, 3rd thing is the issue, which also increases (at least roughly) with
time, same as that table.  It'd help to see the buffer cache hit rate for that
query (and its different query plan nodes), at beginning and EOW.

Justin