Re: Slow join over three tables

Tim van der Linden <tim@xxxxxxxxx> · Fri, 29 Apr 2016 09:19:11 +0900

On Thu, 28 Apr 2016 16:38:53 +0200
Alban Hertroys <haramrae@xxxxxxxxx> wrote:

Hi Alban

First off ... thanks you for your time on this so far and giving me the educational smacks on the head :) I appreciate this a lot. 

> You're doing ~9 times as many index lookups. A slowdown of 6x of this
> part of the query seems rather reasonable.

Again, very true. But why 9 times as much?

> >  Planning time: 16.438 ms
> >  Execution time: 663.583 ms
> >
> > A total of 660 ms instead of the previous 120 ms. The amount of rows returned are 4076, instead of the 448 rows of the previous query. Could this be the cause of the slow down? Or is it the fact that the integers in the ANY clause are further apart from one another and thus make for longer index searches?
> 
> Technically probably both, but the index node distance (assuming a
> btree index) should be insignificant.

It is indeed a b-tree index.

> The sorting of found candidate rows on rid probably contributes the
> most. I forgot how quicksort scales to the number of items to sort,
> probably something like 2log(n) or something like that.

Looked this up and you are right, a quicksort (in C) is of nlog(n) complexity ... and if I'm right this is "rather" linear. Less sorting is thus the message :)

> That said, you didn't happen to perform these tests (assuming >1)
> during a busier period of the database server? I ask because later on
> you seem to be getting fast results again after some more index
> changes and I don't see cause for the difference.

No, these test are on a local development machine with nothing else going on but the database cluster.

> > 2. Adding extra columns ignores indexes
> > ...
> > - a.recovery
> > - a.severity
> > - d.reason
> > - d.effectiveness
> > - d.duration
> 
> I have to admit that I'm a bit behind with the current state of the
> art of PostgreSQL, but last time I checked, the database needed to
> look at each row in the result-set for transaction visibility
> information. I recall that there was (at some point) much discussion
> whether that visibility information could be added to indexes and that
> there were strong arguments against doing so. Hence, I doubt that
> those new index-only scans skip that step.

I see.

> Unless I'm wrong there, adding non-queried fields to the index is only
> going to affect your performance adversely. Relevant fields for
> indices are those used in joins, those regularly used in conditions
> (where-clauses) and those that are regularly sorted over.
> 
> Other fields are available from the candidate rows of the result set.
> Putting them in the index just results in storing redundant
> information.

Storing redundant information and making for bloated indexes which the planner might choose to skip if I understand correctly. Good to know.

> > ...
> > Still fast enough ... but I was wondering why the planner would not use the new index and instead fall back on the "report_drugs_drug_idx" single column index.
> 
> I'm actually a bit suspicious of those numbers; those are different
> drug id's than those from the first explain too.

You are completely right, this was the wrong plan ... the correct plan is:

 Sort  (cost=31757.71..31765.48 rows=3107 width=76) (actual time=722.348..722.461 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 495kB
   ->  Nested Loop  (cost=26041.17..31577.48 rows=3107 width=76) (actual time=139.407..721.090 rows=4076 loops=1)
         ->  Merge Join  (cost=26040.61..27143.53 rows=3107 width=64) (actual time=139.396..170.446 rows=4076 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=16332.08..16671.61 rows=135810 width=35) (actual time=108.866..119.143 rows=135370 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 13992kB
                     ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..4753.44 rows=135810 width=35) (actual time=0.038..53.758 rows=135532 loops=1)
                           Index Cond: (drug = ANY ('{9557,17848,17880,18223}'::integer[]))
               ->  Sort  (cost=9708.53..9904.94 rows=78565 width=29) (actual time=30.517..34.876 rows=77163 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6702kB
                     ->  Index Only Scan using report_adverses_adverse_rid_severity_recovery_idx on report_adverses a  (cost=0.56..3320.57 rows=78565 width=29) (actual time=1.005..16.135 rows=76972 loops=1)
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..1.42 rows=1 width=20) (actual time=0.134..0.135 rows=1 loops=4076)
               Index Cond: (id = d.rid)
               Heap Fetches: 0
 Planning time: 29.415 ms
 Execution time: 723.545 ms

And this is now indeed much closer to the ~660 ms from before, it doesn't make much of a difference after all.

Cheers,
Tim

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general