Re: row estimate for partial index

Harmen <harmen@xxxxxxxxx> · Thu, 19 Jan 2023 10:04:25 +0100

On Mon, Jan 16, 2023 at 09:59:38AM -0500, Tom Lane wrote:
> Harmen <harmen@xxxxxxxxx> writes:
> > On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> >> If you are running a reasonably recent PG version you should be able to
> >> fix that by setting up "extended statistics" on that pair of columns:
> 
> >   CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
> >   CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;

Hello Tom,

thanks again for your help, it's much appreciated.

> 1. ndistinct is not the correct stats type for this problem.
> (I think dependencies is, but generally speaking, it's not worth
> trying to be smarter than the system about which ones you need.
> Just create 'em all.)

I don't understand the "dependencies" stat here. The documentation has as
examples zip codes with city names, and month/day columns. Those I follow. But
in this case I generally select a single "org_id", which can have an arbitrary
number of "deleted" rows. I don't understand how having the org_id available
gives a better prediction than taking the plain ratio of all "deleted" rows in
the whole table.

> 2. Per the CREATE STATISTICS man page, the order of the columns is
> not significant, so you're just doubling the amount of work for
> ANALYZE without gaining anything.

Noted. Are these statistics expensive? Or do you expect them to be basically
noise?

> I think you will find that
> 
> CREATE STATISTICS stats1 ON deleted, org_id FROM contactsbool;
> 
> is enough to fix this.  It improved the estimate for me in
> v14 and HEAD, anyway.

It does indeed for me as well! I'm not 100% sure it's because of the test data
being too simple, but I'll test.

Unfortunately, as you already remarked might be the case in your original
reply, I don't get the same good results if I change the "deleted" column from
a "boolean not null" to a "timestamptz null", though.

Then it's back to (pg15.1):

         ->  Parallel Seq Scan on contacts  (cost=0.00..126284.02 rows=513913 width=16) (actual time=254.677..393.448 rows=5 loops=3)
               Filter: ((deleted IS NULL) AND (org_id = 5))
               Rows Removed by Filter: 3333329

Test table in case that's convenient to have:

  DROP table if exists contacts;
  CREATE table contacts (id int not null, org_id int not null, deleted timestamptz null, firstname text);
  CREATE index contacts_orgs on contacts (org_id, id) where deleted is null;

  WITH ids as (select * from generate_series(0, 10000000)) insert into contacts select ids.generate_series, mod(ids.generate_series, 7), null, 'hello world' from ids;
  UPDATE contacts set deleted = now() where id > 100 and org_id = 5;

  CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contacts;
  CREATE STATISTICS dist5 (dependencies) ON deleted, org_id FROM contacts;
  ANALYZE contacts;

  explain (analyze) select id, firstname from contacts where org_id = 5 and deleted is null order by id;

Again, this is only a problem when a single org_is has such a large percentage
of the table that the planner switches to table scans, which can only ever be a
few org_ids. On production I've added an index with the problematic org_id in
the "where" part, and that avoids the problem for now (milliseconds vs
minutes), but it's not a great fix.

Thanks!
Harmen