On Mon, Jan 16, 2023 at 09:59:38AM -0500, Tom Lane wrote: > Harmen <harmen@xxxxxxxxx> writes: > > On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote: > >> If you are running a reasonably recent PG version you should be able to > >> fix that by setting up "extended statistics" on that pair of columns: > > > CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool; > > CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool; Hello Tom, thanks again for your help, it's much appreciated. > 1. ndistinct is not the correct stats type for this problem. > (I think dependencies is, but generally speaking, it's not worth > trying to be smarter than the system about which ones you need. > Just create 'em all.) I don't understand the "dependencies" stat here. The documentation has as examples zip codes with city names, and month/day columns. Those I follow. But in this case I generally select a single "org_id", which can have an arbitrary number of "deleted" rows. I don't understand how having the org_id available gives a better prediction than taking the plain ratio of all "deleted" rows in the whole table. > 2. Per the CREATE STATISTICS man page, the order of the columns is > not significant, so you're just doubling the amount of work for > ANALYZE without gaining anything. Noted. Are these statistics expensive? Or do you expect them to be basically noise? > I think you will find that > > CREATE STATISTICS stats1 ON deleted, org_id FROM contactsbool; > > is enough to fix this. It improved the estimate for me in > v14 and HEAD, anyway. It does indeed for me as well! I'm not 100% sure it's because of the test data being too simple, but I'll test. Unfortunately, as you already remarked might be the case in your original reply, I don't get the same good results if I change the "deleted" column from a "boolean not null" to a "timestamptz null", though. Then it's back to (pg15.1): -> Parallel Seq Scan on contacts (cost=0.00..126284.02 rows=513913 width=16) (actual time=254.677..393.448 rows=5 loops=3) Filter: ((deleted IS NULL) AND (org_id = 5)) Rows Removed by Filter: 3333329 Test table in case that's convenient to have: DROP table if exists contacts; CREATE table contacts (id int not null, org_id int not null, deleted timestamptz null, firstname text); CREATE index contacts_orgs on contacts (org_id, id) where deleted is null; WITH ids as (select * from generate_series(0, 10000000)) insert into contacts select ids.generate_series, mod(ids.generate_series, 7), null, 'hello world' from ids; UPDATE contacts set deleted = now() where id > 100 and org_id = 5; CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contacts; CREATE STATISTICS dist5 (dependencies) ON deleted, org_id FROM contacts; ANALYZE contacts; explain (analyze) select id, firstname from contacts where org_id = 5 and deleted is null order by id; Again, this is only a problem when a single org_is has such a large percentage of the table that the planner switches to table scans, which can only ever be a few org_ids. On production I've added an index with the problematic org_id in the "where" part, and that avoids the problem for now (milliseconds vs minutes), but it's not a great fix. Thanks! Harmen