Re: Inaccurate Rows estimate for "Bitmap And" causes Planner to choose wrong join

Jeff Janes <jeff.janes@xxxxxxxxx> · Wed, 6 May 2020 14:24:55 -0400

On Wed, May 6, 2020 at 12:20 PM Steve Pritchard <steve.pritchard@xxxxxxx> wrote:
Version: Postgres 9.6.3 production system (but also tested on Postgres 12)
For my query the Planner is sometimes choosing an execution plan that uses "Bitmap And" (depending on the parameters):

->  Bitmap Heap Scan on observation  (cost=484.92..488.93 rows=1 width=203) (actual time=233.129..330.886 rows=15636 loops=1)
  Recheck Cond: (((user_id)::text = 'USER123'::text) AND ((loc_id)::text = ANY ('{LOC12345678}'::text[])))

If you change " = ANY(array_of_one)" to " = scalar", does that change anything?  You might be able to fix this (in v12) using CREATE STATISTICS, but I don't know if that mechanism can see through the ANY(array_of_one) wrapper.

Note that in cases where the Planner selects a single Index Scan for this query (with different parameters), the Planner makes an accurate estimate of the number of rows and then makes sensible selections of joins (i.e. quick).
i.e. the issue seems to be with the "Bitmap And".

I don't know if this nitpick matters, but I don't think that that is how the planner works.  The row estimates work from the top down, not the bottom up.  The row estimate of 1 is based on what conditions the bitmap heap scan implements, it is not arrived at by combining the estimates from the index scans below it.  If it were to change to a different type of node but implemented the same conditions, I think it would have the same row estimate.  

I don't have an index with both user_id & loc_id, as this is one of several different combinations that can arise (it would require quite a few indexes to cover all the possible combinations). 

Are you actually experiencing problems with those other combinations as well?  If not, I wouldn't worry about solving hypothetical problems.  If those other combinations are actually problems and you go with CREATE STATISTICS, then you would have to be creating a lot of different statistics.  That would still be ugly, but at least the overhead for statistics is lower than for indexes.

However if I did have such an index, the planner would presumably be able to use the statistics for user_id and loc_id to estimate the number of rows.

Indexes on physical columns do not have statistics, so making that index would not help with the estimation.  (Expressional indexes do have statistics, but I don't see that helping you here).  

So while this node would execute faster with that index, it would still be kicking the unshown nested loop left join 15,636 times when it thinks it will be doing it once, and so would still be slow.  The most robust solution might be to make the outer part of that nested loop left join faster, so that your system would be more tolerant of statistics problems.  

So why can't it make an accurate estimate of the rows with a "Bitmap And" & "

Bitmap Heap Scan"? (as above)

In the absence of custom statistics, it assumes the selectivities of user_id = 'USER123', of loc_id = ANY ('{LOC12345678}'::text[]), and of taxa = 'Birds' are all independent of each other and can be multiplied to arrive at the overall selectivity.  But clearly that is not the case.  Bird watchers mostly watch near where they live, not in random other places.

Cheers,

Jeff