Slow join over three tables

Tim van der Linden <tim@xxxxxxxxx> · Wed, 27 Apr 2016 08:27:50 +0900

Hi all

I have asked this question in a somewhat different form on the DBA Stackexchange site, but without much luck (https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables). So I apologize for possible double posting, but I hope this might get a better response on the mailing list.

I'm joining three fairly large tables together, and it is slow. The tables are:

- "reports": 6 million rows
- "report_drugs": 20 million rows
- "report_adverses": 20 million rows

The table "reports" holds main report data and has a primary key column "id". The other two tables have a foreign key to that table with "rid". It are those columns that I use to join them together.

All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.

The query:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id 
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain']) 
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

The plan:

Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual time=5143.162..5143.185 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual time=54.784..5142.872 rows=448 loops=1)
     Join Filter: (d.rid = a.rid)
     ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
         ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 loops=1)
               Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
         ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
               Index Cond: (id = d.rid)
     ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 loops=14199)
           Index Cond: (rid = r.id)
           Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
           Rows Removed by Filter: 5
Planning time: 13.994 ms
Execution time: 5143.235 ms

This takes well over 5 seconds, which to me, feels much too slow.
If I query each table directly with the same conditions, thus:

SELECT reason
FROM report_drugs 
WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);

I get:

Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
  Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Planning time: 6.939 ms
Execution time: 4.759 ms

Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.

This indicates to me that indeed the join itself causes a major performance bottleneck.

I'm running the cluster from an SSD drive, as a traditional HDD could not even manage the query in under 5 minutes. The system has a total memory of 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor.

Some important postgresql.conf readouts:

- shared_buffers = 4GB
- work_mem = 64MB
- maintenance_work_mem = 1GB
- checkpoint_segments = 50
- checkpoint_completion_target = 0.9
- autovacuum = on

Is there something I am missing here? Any help on getting this join faster is much appreciated.

Cheers,
Tim

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general