On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote: > Harmen <harmen@xxxxxxxxx> writes: > > Works well enough. However, we now have an org_id which has > 10% of the rows, > > but only a handful rows where "deleted is null" matches (so the org has a lot > > of "deleted" contacts). The planner doesn't like this and it falls back to a > > full table scan for the above query. > > > I've added a dedicated index just for that org_id, to see if that helps: > > "org123" btree (id) WHERE deleted IS NULL AND org_id = 123 > > The planner seems to use it now, however the row estimate is way off: > > Yeah, so that indicates that it isn't producing a good selectivity > estimate for the combination of those two conditions: it will assume > the org_id and deleted columns are independent, which per your statements > they are not. > > If you are running a reasonably recent PG version you should be able to > fix that by setting up "extended statistics" on that pair of columns: > > https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED > > (I might be wrong, but I think that will help even when one of > the troublesome conditions is a null-check. If it doesn't, then > we have something to improve there ...) Thanks for your explanation, Tom. I've setup a local test scenario, where I then add a "dependencies" stat, but that doesn't give a better plan, unfortunately. This is my test table (I use a boolean field for "deleted" to keep this test case as simple as possible. In my real case this is a "timestamptz null" field): DROP table if exists contactsbool; CREATE table contactsbool (id int not null, org_id int not null, deleted boolean not null, firstname text); CREATE index contactsbool_orgs on contactsbool (org_id, id) where not deleted; Testdata has a very low number of "orgs", and one org has almost only deleted contacts: WITH ids as (select * from generate_series(0, 10000000)) insert into contactsbool select ids.generate_series, mod(ids.generate_series, 7), false, 'hello world' from ids; UPDATE contactsbool set deleted = true where id > 100 and org_id = 5; ANALYZE contactsbool; Now the new stats: CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool; CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool; ANALYZE contactsbool; harmen=> explain (analyze) select id, firstname from contactsbool where org_id = 5 and not deleted order by id; QUERY PLAN ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Gather Merge (cost=181983.91..299104.42 rows=1003820 width=16) (actual time=448.244..454.770 rows=14 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=180983.88..182238.66 rows=501910 width=16) (actual time=413.761..413.762 rows=5 loops=3) Sort Key: id Sort Method: quicksort Memory: 25kB Worker 0: Sort Method: quicksort Memory: 25kB Worker 1: Sort Method: quicksort Memory: 25kB -> Parallel Seq Scan on contactsbool (cost=0.00..124881.86 rows=501910 width=16) (actual time=267.318..413.673 rows=5 loops=3) Filter: ((NOT deleted) AND (org_id = 5)) Rows Removed by Filter: 3333329 Planning Time: 0.565 ms JIT: Functions: 12 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 2.444 ms, Inlining 0.000 ms, Optimization 1.163 ms, Emission 13.288 ms, Total 16.895 ms Execution Time: 456.498 ms (17 rows) The "rows=501910" is what I don't expect. I expect/want/hope the plan to use the contactsbool_orgs index. (If I really (hard) delete the "deleted" contacts everything works perfectly for all orgs.) Any ideas? Thanks again, Harmen