On 28 April 2016 at 08:36, Tim van der Linden <tim@xxxxxxxxx> wrote: > On Wed, 27 Apr 2016 13:48:06 +0200 > Alban Hertroys <haramrae@xxxxxxxxx> wrote: > >> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better off with the inverse of above index: >> create index on report_adverses (adverse, rid); >> create index on report_drugs (drug, rid); > > Hmmmm, like I reported yesterday this achieved a huge performance boost. > > However, two (new) things I like to touch on while further experimenting with this query: > > 1. Altering Drug IDs or Adverse names effects the executing time negatively. > > In this example altering the drug IDs I search for makes the query 6 times slower again: > > SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug > FROM reports r > JOIN report_drugs d ON d.rid = r.id > JOIN report_adverses a ON a.rid = r.id > WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain']) > AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created; > > Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. Query plan: > > Sort (cost=31409.71..31417.48 rows=3107 width=41) (actual time=662.707..662.819 rows=4076 loops=1) > Sort Key: r.created > Sort Method: quicksort Memory: 415kB > -> Nested Loop (cost=25693.17..31229.48 rows=3107 width=41) (actual time=71.748..661.743 rows=4076 loops=1) > -> Merge Join (cost=25692.61..26795.53 rows=3107 width=29) (actual time=70.841..97.094 rows=4076 loops=1) > Merge Cond: (d.rid = a.rid) > -> Sort (cost=16332.08..16671.61 rows=135810 width=8) (actual time=48.946..58.623 rows=135370 loops=1) > Sort Key: d.rid > Sort Method: quicksort Memory: 12498kB > -> Index Scan using report_drugs_drug_idx on report_drugs d (cost=0.44..4753.44 rows=135810 width=8) (actual time=0.681..28.441 rows=135532 loops=1) > Index Cond: (drug = ANY ('{9557,17848,17880,18223}'::integer[])) You're doing ~9 times as many index lookups. A slowdown of 6x of this part of the query seems rather reasonable. > Planning time: 16.438 ms > Execution time: 663.583 ms > > A total of 660 ms instead of the previous 120 ms. The amount of rows returned are 4076, instead of the 448 rows of the previous query. Could this be the cause of the slow down? Or is it the fact that the integers in the ANY clause are further apart from one another and thus make for longer index searches? Technically probably both, but the index node distance (assuming a btree index) should be insignificant. The sorting of found candidate rows on rid probably contributes the most. I forgot how quicksort scales to the number of items to sort, probably something like 2log(n) or something like that. That said, you didn't happen to perform these tests (assuming >1) during a busier period of the database server? I ask because later on you seem to be getting fast results again after some more index changes and I don't see cause for the difference. > The same happens with changing the adverses, some adverses come back in ~120 ms, others can take up to one second. Likely the same issue > 2. Adding extra columns ignores indexes > > This one was kind of expected. Adding extra columns from both "report_drugs" and "report_adverses" slows the lot down again. Extra columns added: > > - a.recovery > - a.severity > - d.reason > - d.effectiveness > - d.duration I have to admit that I'm a bit behind with the current state of the art of PostgreSQL, but last time I checked, the database needed to look at each row in the result-set for transaction visibility information. I recall that there was (at some point) much discussion whether that visibility information could be added to indexes and that there were strong arguments against doing so. Hence, I doubt that those new index-only scans skip that step. Unless I'm wrong there, adding non-queried fields to the index is only going to affect your performance adversely. Relevant fields for indices are those used in joins, those regularly used in conditions (where-clauses) and those that are regularly sorted over. Other fields are available from the candidate rows of the result set. Putting them in the index just results in storing redundant information. That's how it always has been and I'm pretty sure that's still the case. > As I would expect, adding these columns would make the previous multi-column indexes useless. And they indeed were not used anymore. What adding those fields to the index did was drive up the cost of using that index. It requires more memory to cache, for example, and this will be evicted from caches sooner than a smaller index. I suspect the query planner also noticed that and chose for the smaller index as that indexes usage costs are now lower than those of your (unnecessarily) large index. > Still fast enough ... but I was wondering why the planner would not use the new index and instead fall back on the "report_drugs_drug_idx" single column index. I'm actually a bit suspicious of those numbers; those are different drug id's than those from the first explain too. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general