Re: Optimizing count(), but Explain estimates wildly off

Chema <chema@xxxxxxxxxxxxx> · Thu, 29 Feb 2024 17:15:42 -0600

> Reading around, seems many people are still using this 2005 snippet to obtain the

> row count estimate from Explain:

I recommend using FORMAT JSON and extracting the top row count from that.  It is

simpler and less error-prone.
Good tip, thanks Laurenze!

> Is this still the current best practice?  Any tips to increase precision?

> Currently it can estimate the actual number of rows for over or under a million,

> as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 instead

> of 1,292,010).

Looking at the samples you provided, I get the impression that the statistics for

the table are quite outdated.  That will affect the estimates.  Try running ANALYZE

and see if that improves the estimates.

No major changes after doing Analyze, and also Vacuum Analyze.  Seems something is seriously off.  I pimped my config thanks to Alvaro's prompting, set default statistics = 500 (suggested for warehouse dbs) but raised pertinent columns from 2,000 to 5,000 (will play with disabling JIT or raising cost later):

shared_buffers = 2GB                # ~0.25 * RAM, dedicated cache, hard allocation (requires restart)
effective_cache_size = 6GB          # 0.5-0.75 RAM (free -h: free + cache + shared_buffers)
work_mem = 128MB                    # RAM * 0.25 / max_connections.
maintenance_work_mem = 512MB
default_statistics_target = 500     # def 100, higher to make planner use indexes in big warehouse tables.
random_page_cost = 1.1              # Random reads in SSD cost almost as little as sequential ones

Analized again (1.5M samples instead of 600k):
"tenders": scanned 216632 of 216632 pages, containing 3815567 live rows and 0 dead rows; 1500000 rows in sample, 3815567 estimated total rows
"items": scanned 995023 of 995023 pages, containing 7865043 live rows and 0 dead rows; 1500000 rows in sample, 7865043 estimated total rows

but same deal:

-- After config pimp 1,959,657 instead of 1,001,200 45,341.654 ms
Gather  (cost=247031.70..1479393.82 rows=1959657 width=824) (actual time=8464.691..45257.435 rows=1001200 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Hash Join  (cost=246031.70..1282428.12 rows=816524 width=824) (actual time=8413.057..44614.153 rows=333733 loops=3)
        Hash Cond: (pricescope_items.tender_transaction_id = pricescope_tenders.transaction_id)
        ->  Parallel Seq Scan on pricescope_items  (cost=0.00..1027794.01 rows=3277101 width=522) (actual time=0.753..41654.507 rows=2621681 loops=3)
        ->  Parallel Hash  (cost=241080.20..241080.20 rows=396120 width=302) (actual time=995.247..995.250 rows=333733 loops=3)
              Buckets: 1048576  Batches: 1  Memory Usage: 219904kB
              ->  Parallel Bitmap Heap Scan on pricescope_tenders  (cost=17516.10..241080.20 rows=396120 width=302) (actual time=162.898..321.472 rows=333733 loops=3)
                    Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
                    Heap Blocks: exact=34722
                    ->  Bitmap Index Scan on pricescope_tenders_country_and_date_index  (cost=0.00..17278.43 rows=950688 width=0) (actual time=186.536..186.537 rows=1001200 loops=1)
                          Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
Planning Time: 11.310 ms
JIT:
  Functions: 33
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 8.608 ms, Inlining 213.375 ms, Optimization 557.351 ms, Emission 417.568 ms, Total 1196.902 ms
Execution Time: 45341.654 ms

BUT if I force the planner to ignore 'country' statistics:
-- Subselect country to hide constant from planner, so it doesn't use statistics
Explain Analyze
Select * from pricescope_tenders inner join pricescope_items on transaction_id = tender_transaction_id
	where country = (select 'Colombia')
		and "date" >= '2023-01-01' and "date" < '2024-01-01'
;

Then I get the same plan than if I filter for Mexico, with a similar run time:

-- Colombia in subselect 428,623 instead of 1,001,200 6674.860 ms
Gather  (cost=1001.00..570980.73 rows=428623 width=824) (actual time=166.785..6600.673 rows=1001200 loops=1)
  Workers Planned: 2
  Params Evaluated: $0
  Workers Launched: 2
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=166.031..166.033 rows=1 loops=1)
  ->  Nested Loop  (cost=0.99..527118.42 rows=178593 width=824) (actual time=200.511..5921.585 rows=333733 loops=3)
        ->  Parallel Index Scan using pricescope_tenders_country_and_date_index on pricescope_tenders  (cost=0.43..104391.64 rows=86641 width=302) (actual time=200.388..400.882 rows=333733 loops=3)
              Index Cond: ((country = $0) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
        ->  Index Scan using pricescope_items_tender_transaction_id_index on pricescope_items  (cost=0.56..4.83 rows=5 width=522) (actual time=0.016..0.016 rows=1 loops=1001200)
              Index Cond: (tender_transaction_id = pricescope_tenders.transaction_id)
Planning Time: 7.372 ms
JIT:
  Functions: 31
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 6.981 ms, Inlining 209.470 ms, Optimization 308.123 ms, Emission 248.176 ms, Total 772.750 ms
Execution Time: 6674.860 ms

So runtime is now decent; stats are still way off by -670k, tho I guess that's better than +1M.

1. Any tips to fix stats?
2. Or a better way of making the planner go for index scans for country?

Thanks again!