Re: row estimate for partial index

Harmen <harmen@xxxxxxxxx> · Mon, 16 Jan 2023 09:55:23 +0100

On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> Harmen <harmen@xxxxxxxxx> writes:
> > Works well enough. However, we now have an org_id which has > 10% of the rows,
> > but only a handful rows where "deleted is null" matches (so the org has a lot
> > of "deleted" contacts). The planner doesn't like this and it falls back to a
> > full table scan for the above query.
> 
> > I've added a dedicated index just for that org_id, to see if that helps:
> >     "org123" btree (id) WHERE deleted IS NULL AND org_id = 123
> > The planner seems to use it now, however the row estimate is way off:
> 
> Yeah, so that indicates that it isn't producing a good selectivity
> estimate for the combination of those two conditions: it will assume
> the org_id and deleted columns are independent, which per your statements
> they are not.
> 
> If you are running a reasonably recent PG version you should be able to
> fix that by setting up "extended statistics" on that pair of columns:
> 
> https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED
> 
> (I might be wrong, but I think that will help even when one of
> the troublesome conditions is a null-check.  If it doesn't, then
> we have something to improve there ...)

Thanks for your explanation, Tom.
I've setup a local test scenario, where I then add a "dependencies" stat, but
that doesn't give a better plan, unfortunately.

This is my test table (I use a boolean field for "deleted" to keep this test
case as simple as possible. In my real case this is a "timestamptz null"
field):

  DROP table if exists contactsbool;
  CREATE table contactsbool (id int not null, org_id int not null, deleted boolean not null, firstname text);
  CREATE index contactsbool_orgs on contactsbool (org_id, id) where not deleted;

Testdata has a very low number of "orgs", and one org has almost only deleted
contacts:

  WITH ids as (select * from generate_series(0, 10000000)) insert into contactsbool select ids.generate_series, mod(ids.generate_series, 7), false, 'hello world' from ids;
  UPDATE contactsbool set deleted = true where id > 100 and org_id = 5;
  ANALYZE contactsbool;

Now the new stats:

  CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
  CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;
  ANALYZE contactsbool;

harmen=> explain (analyze) select id, firstname from contactsbool where org_id = 5 and not deleted order by id;
                                                                QUERY PLAN                                                                
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather Merge  (cost=181983.91..299104.42 rows=1003820 width=16) (actual time=448.244..454.770 rows=14 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=180983.88..182238.66 rows=501910 width=16) (actual time=413.761..413.762 rows=5 loops=3)
         Sort Key: id
         Sort Method: quicksort  Memory: 25kB
         Worker 0:  Sort Method: quicksort  Memory: 25kB
         Worker 1:  Sort Method: quicksort  Memory: 25kB
         ->  Parallel Seq Scan on contactsbool  (cost=0.00..124881.86 rows=501910 width=16) (actual time=267.318..413.673 rows=5 loops=3)
               Filter: ((NOT deleted) AND (org_id = 5))
               Rows Removed by Filter: 3333329
 Planning Time: 0.565 ms
 JIT:
   Functions: 12
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 2.444 ms, Inlining 0.000 ms, Optimization 1.163 ms,
Emission 13.288 ms, Total 16.895 ms
 Execution Time: 456.498 ms
(17 rows)

The "rows=501910" is what I don't expect. I expect/want/hope the plan to use
the contactsbool_orgs index.
(If I really (hard) delete the "deleted" contacts everything works perfectly
for all orgs.)

Any ideas?
Thanks again,
Harmen