Search Postgresql Archives

Re: Poor performance on 9.4.4 when matching multiple columns from left side of LATERAL join

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 14 November 2015 at 19:25, Steven Grimm <sgrimm@xxxxxxxxxxxxxx> wrote:

Execution plan for the IN version followed by the = version (for just one of the IDs):

---------------------------------------------
Nested Loop  (cost=5.39..8107.18 rows=285 width=4) (actual time=1.230..6456.567 rows=4499 loops=1)
   Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2, mid.id3]))
   Rows Removed by Join Filter: 22495501
   ->  Seq Scan on multi_id mid  (cost=0.00..78.00 rows=5000 width=12) (actual time=0.010..1.385 rows=5000 loops=1)
   ->  Materialize  (cost=5.39..310.66 rows=95 width=4) (actual time=0.000..0.263 rows=4500 loops=5000)
         ->  Bitmap Heap Scan on settings  (cost=5.39..310.19 rows=95 width=4) (actual time=1.207..3.210 rows=4500 loops=1)
               Recheck Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
               Heap Blocks: exact=1405
               ->  Bitmap Index Scan on settings_idx_setting_id  (cost=0.00..5.37 rows=95 width=0) (actual time=0.930..0.930 rows=4500 loops=1)
                     Index Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
 Planning time: 0.178 ms
 Execution time: 6456.897 ms


 Hash Join  (cost=145.98..472.93 rows=103 width=4) (actual time=2.677..6.890 rows=4500 loops=1)
   Hash Cond: (settings.owner_id = mid.id1)
   ->  Bitmap Heap Scan on settings  (cost=5.48..330.50 rows=103 width=4) (actual time=1.194..3.477 rows=4500 loops=1)
         Recheck Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
         Heap Blocks: exact=1405
         ->  Bitmap Index Scan on settings_idx_setting_id  (cost=0.00..5.45 rows=103 width=0) (actual time=0.854..0.854 rows=4500 loops=1)
               Index Cond: ((setting_id = 1) AND (setting_value = 'common_1'::text))
   ->  Hash  (cost=78.00..78.00 rows=5000 width=4) (actual time=1.463..1.463 rows=5000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 176kB
         ->  Seq Scan on multi_id mid  (cost=0.00..78.00 rows=5000 width=4) (actual time=0.007..0.717 rows=5000 loops=1)
 Planning time: 0.311 ms
 Execution time: 7.166 ms
---------------------------------------------

What am I doing wrong in the IN version of the query, if anything?


In order to allow hash or merge joins, there needs to be at least 1 common equality _expression_ in the join condition. Your IN query does not have this.

Notice that with your IN() query that the IN() clause is evaluated in the nested loop's join condition:

Join Filter: (settings.owner_id = ANY (ARRAY[mid.id1, mid.id2, mid.id3]))

If you think of what a nested loop is, basically for each outer row it must scan each inner row until it finds a match. Since this is a SEMI join, it can stop once the first match is found, and then move on to the next outer row. It then goes and does all that again with the next outer row until there's no outer rows left. This is why the IN version is so slow. In you case 22495501 + 4499 comparisons of that join condition have been made, which is probably not too unreasonable for 6.5 seconds.

The problem is that the optimizer is unable to use hash join or merge joins when you have the IN() condition as the join condition, the reason for this is that you're effectively saying to join on any of 3 conditions: settings.owner_id = mid.id1 OR settings.owner_id = mid.id2 OR settings.owner_id = mid.id3. If you think how a hash join works then 1 hash table is no good here, as you effectively have 3 possible keys for the hash table, the executor would have to build 3 tables to make that possible, but we only ever build 1 in PostgreSQL. As you may know, a hash table is a very efficient data structure for key value lookups, but there can only be a single key.  Merge join has the same problem because it's only possible to have a single sort order.

You might think that all this sounds terrible, but the main reason that you've got this issue is down to the design of the multi_id table. If you just had a single ID and perhaps another column to store the 1,2,3 (the need for this is not quite clear to me), also perhaps your PK to reference some other table, you could just write the query as:

select id from multi_id where id in(select owner_id from settings where  setting_id = 1 and setting_value = 'common_1');

Quite possibly this would produce a plan with a hash join and execute very quickly.

With your current schema the only option that I can see is to build a query such as:

select id1 from multi_id where id1 in (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1')
union
select id2 from multi_id where id2 in (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1')
union
select id3 from multi_id where id3 in (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1')

but then you're suffering from having to perhaps build the same hash table 3 times, and also the overhead of the UNION getting rid of duplicate rows. You could perhaps make it slightly better with a common table _expression_ such as:

with cte (select owner_id from settings where  setting_id = 1 and setting_value = 'common_1') as
select id1 from multi_id where id1 in (select owner_id from cte)
union
select id2 from multi_id where id2 in (select owner_id from cte)
union
select id3 from multi_id where id3 in (select owner_id from cte);

but you still have the union overhead.
 
--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux