Re: Selectivity and row count estimates for JSONB columns

Joel Perren <joel.perren@xxxxxxxxx> · Wed, 20 Oct 2021 17:08:01 +0100

After some further investigation I think I might have actually got this the wrong way round. I now strongly suspect that what I am seeing is related to the presence of histogram_bounds for Table A (possibly a bug? or me being silly).
Looking into the code (https://github.com/postgres/postgres/blob/8c1144ba73478b818d9cebe8ecd64a14b7d45bde/src/backend/utils/adt/selfuncs.c):
- The matchingsel() function calls the generic_restriction_selectivity() function
- generic_restriction_selectivity() returns the default value (0.01) for data types that Postgres doesn't collect standard MCV and/or histogram statistics for. I think this is what happens with Table B which (quite correctly) does not have these statistics in pg_stats
- Table A does have histogram_bounds for some reason, so I suspect it gets further through the function, operating on garbage, to this line:

if (selec < 0.0001)
    selec = 0.0001;

which ultimate ends up getting returned.

Therefore, Table A gets the 'correct' default value of 0.01 (which is actually 1%, silly me). Table B meanwhile gets an erroneous value of 0.0001 (0.1%) .

So my question now is: why is Postgres creating histogram_bounds on some jsonb columns but not others? I've looked at my CREATE TABLE scripts and they are identical. We have the database deployed in another environment and this issue reoccurs there too.

Any help would be appreciated :)

Thanks
Joel

On Wed, Oct 20, 2021 at 3:56 PM Joel Perren <joel.perren@xxxxxxxxx> wrote:
Hi all
I have a Postgres/PostGIS database with two separate database tables each of which has the following column structure:

- identifier (text)
- geometry (geometryz,27700)
- properties (jsonb)

I have created a GIN index (jsonb_path_ops) over the properties column on both tables.

As I understand it, Postgres is unable to calculate statistics on the contents of JSONB columns and so should use hard-coded estimates of selectivity when planning queries. However, despite both tables having identical structures, similar row counts, identical indexes, and both having been recently ANALYZEd, I am finding that the query planner is producing different row count estimates between the tables.

To expand: both tables have a 'description' field within their JSONB columns. I am executing the following query:

EXPLAIN
SELECT identifier, geometry, properties FROM table
WHERE properties @@ '$.description == "test"'
ORDER BY identifier;

I am expecting that the selectivity value used for such queries should be 0.010 which is the default result of the matchingsel selectivity function which the @@ operator uses when operating on jsonb.

For both tables, the planner opts for a Bitmap Index Scan -> Sort -> Gather Merge. However, the estimated number of rows returned from the Index Scan node differs substantially.

- Table A (1,611,752  rows): 159 estimated rows (this is roughly 0.01% and makes sense);
- Table B (1,656,110 rows): 16566 estimated rows (roughly 1% - why??)

This difference is causing the planner to come up with some strange plans for queries on Table B which it does not do for Table A. I know that both estimates are just that and that neither is 'better', but I would really prefer it to be consistent and use a selectivity of 0.01 rather than 0.1 as it should be doing.

I know this is getting long now, so apologies. But one thing I did notice and wondered if it could be related is that the pg_stats table for Table A seems to have values for histogram_bounds and correlation, but nulls for these values for Table B.

Any ideas?

Thanks
Joel