Table with many NULLS for indexed column yields strange query plan

greigwise <greigwise@xxxxxxxxxxx> · Thu, 5 Mar 2020 18:08:53 -0700 (MST)

I have a query like this:

SELECT  "table1".* FROM "table1"
INNER JOIN "table2" ON "table2"."table1_id" = "table1"."id"
INNER JOIN "table3" ON "table3"."id" = "table2"."table3_id" WHERE
"table3"."number" = 'xxxx'
AND ("table2"."type") IN ('Standard') ;

table2 has a large number of NULLS in the column table3_id.  There is an
index on this column.  Here is the result of explain analyze:

Merge Join  (cost=1001.20..4076.67 rows=17278 width=167) (actual
time=284.918..300.167 rows=2244 loops=1)
   Merge Cond: (table2.table3_id = table3.id)
   ->  Gather Merge  (cost=1000.93..787825.78 rows=621995 width=175) (actual
time=5.786..283.269 rows=64397 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         ->  Nested Loop  (cost=0.87..712740.12 rows=155499 width=175)
(actual time=0.091..102.708 rows=13107 loops=5)
               ->  Parallel Index Scan using index_table2_on_table3_id on
table2  (cost=0.43..489653.08 rows=155499 width=16) (actual
time=0.027..22.327 rows=13107 loops=5)
                     Filter: ((type)::text = 'Standard'::text)
               ->  Index Scan using table1_pk on table1  (cost=0.44..1.43
rows=1 width=167) (actual time=0.005..0.005 rows=1 loops=65535)
                     Index Cond: (id = table2.table1_id)
   ->  Index Scan using table3_pkey on table3  (cost=0.27..53.40 rows=1
width=8) (actual time=0.041..0.048 rows=1 loops=1)
         Filter: ((number)::text = 'xxxx'::text)
         Rows Removed by Filter: 35
 Planning time: 0.450 ms
 Execution time: 310.230 ms

You can see the row estimate there is way off on the Parallel Index Scan.  
I suspect that this is because it's including the rows with null in the
selectivity estimate even though the table3_id can't possibly be null here
due to the inner join.  

If I modify the query like this:

SELECT  "table1".* FROM "table1"
INNER JOIN "table2" ON "table2"."table1_id" = "table1"."id"
INNER JOIN "table3" ON "table3"."id" = "table2"."table3_id" WHERE
"table3"."number" = 'xxxx'
AND ("table2"."type") IN ('Standard') and table3_id is not null;

Just adding in table3_id is not null at the end there, I get a much better
plan.

Nested Loop  (cost=1.14..290.04 rows=66 width=167) (actual
time=0.058..11.258 rows=2244 loops=1)
   ->  Nested Loop  (cost=0.70..64.46 rows=66 width=8) (actual
time=0.049..2.873 rows=2244 loops=1)
         ->  Index Scan using table3_pkey on table3  (cost=0.27..53.40
rows=1 width=8) (actual time=0.030..0.035 rows=1 loops=1)
               Filter: ((number)::text = 'xxxx'::text)
               Rows Removed by Filter: 35
         ->  Index Scan using index_table2_on_table3_id on table2 
(cost=0.43..11.05 rows=1 width=16) (actual time=0.017..2.102 rows=2244
loops=1)
               Index Cond: ((table3_id = table3.id) AND (table3_id IS NOT
NULL))
               Filter: ((type)::text = 'Standard'::text)
   ->  Index Scan using table1_pk on table1  (cost=0.44..3.42 rows=1
width=167) (actual time=0.003..0.003 rows=1 loops=2244)
         Index Cond: (id = table2. id)
 Planning time: 0.403 ms
 Execution time: 11.672 ms

Can I do anything statistics wise so that I get a better plan here or do I
have to modify the query.  It seems kinda hacky that I would have to specify
is not null on that column since like I said it can't possibly be null.

Thanks,

Greig Wise

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html