Re: Slow join over three tables

David Rowley <david.rowley@xxxxxxxxxxxxxxx> · Wed, 27 Apr 2016 14:09:04 +1200

On 27 April 2016 at 11:27, Tim van der Linden <tim@xxxxxxxxx> wrote:
> Hi all
>
> I have asked this question in a somewhat different form on the DBA Stackexchange site, but without much luck (https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables). So I apologize for possible double posting, but I hope this might get a better response on the mailing list.
>
> I'm joining three fairly large tables together, and it is slow. The tables are:
>
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
>
> The table "reports" holds main report data and has a primary key column "id". The other two tables have a foreign key to that table with "rid". It are those columns that I use to join them together.
>
> All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.
>
> The query:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>
> The plan:
>
> Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual time=5143.162..5143.185 rows=448 loops=1)
>    Sort Key: r.created
>    Sort Method: quicksort  Memory: 60kB
>    ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual time=54.784..5142.872 rows=448 loops=1)
>      Join Filter: (d.rid = a.rid)
>      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
>          ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 loops=1)
>                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
>                Index Cond: (id = d.rid)
>      ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 loops=14199)
>            Index Cond: (rid = r.id)
>            Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
>            Rows Removed by Filter: 5
> Planning time: 13.994 ms
> Execution time: 5143.235 ms
>
> This takes well over 5 seconds, which to me, feels much too slow.
> If I query each table directly with the same conditions, thus:
>
> SELECT reason
> FROM report_drugs
> WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);
>
> I get:
>
> Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
>   Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> Planning time: 6.939 ms
> Execution time: 4.759 ms
>
> Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.

I'm not sure why you're comparing this to the join plan above. They're
very different, I can only imagine that it' because you've not quite
understood what the EXPLAIN output is saying.

>      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
>          ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 loops=1)
>                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
>                Index Cond: (id = d.rid)

This is a parameterised nested loop. This means that the inner side of
the loop (reports), is parameterised by the outerside of the loop, you
can see the d.rid in the Index Cond. This means that the
reports_id_key index is looked up 14199 times. You can see from the
plan that the nested loop here took 2038.952 milliseconds to complete,
so about 0.144 ms per lookup, not too bad, right?

The alternative to a parameterised nested loop plan is that a Hash
Join plan is generated. If you imagine what would happen here, likely
the matching report_drugs records would be hashed, then the outer side
of the join would then perform a SeqScan over the entire 6 million
reports records... probably not better. You can test this yourself
with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
= 0; too. I imagine it'll be slower.

Likely the only way to speed this up would be to create indexes;

create index on reports (id, age, gender, created);

the above might allow an index only scan, which should speed up the
nested loop a bit. This will only be useful if you're never going to
need other report columns in the SELECT list.

and perhaps;

create index on report_adverses (rid, adverse);

this might speedup the 2nd join a bit.

On the other hand if you do find that SET enable_nestloop =0; to be
faster, then you may want to tweak some costs to encourage the planner
to choose that plan.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general