Is your transaction id more or less monotonic according to the date? If so, something like the next can help:
with tenders_filtered as (select * from tenders where country = 'Mexico'
and "date" >= '2023-01-01' and "date" < '2024-01-01')Select * from tenders_filtered inner join items on transaction_id = tender_transaction_id
where tender_transaction_id between (select min(transaction_id) from tenders_filtered) and (select max(transaction_id) from tenders_filtered)
where tender_transaction_id between (select min(transaction_id) from tenders_filtered) and (select max(transaction_id) from tenders_filtered)
This assumes you have an index on items(tender_transaction_id) and it would be able to select a small subset (less than say 5%) of the table.
If your transaction_id is not monotonic, you can consider having something monotonic or even additional denormalized field(s) with country and/or date to your items.
Another option is to use a windowing function to get the count, e.g.
Select *,count(*) OVER () as cnt from tenders inner join items on transaction_id = tender_transaction_id
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'
This would at least save you from doing a second call.
пн, 26 лют. 2024 р. о 16:26 Chema <chema@xxxxxxxxxxxxx> пише:
Dear pgsqlers,I'm trying to optimize simple queries on two tables (tenders & items) with a couple million records. Besides the resulting records, the app also displays the count of total results. Doing count() takes as much time as the other query (which can be 30+ secs), so it's an obvious target for optimization. I'm already caching count() results for the most common conditions (country & year) in a material table, which practically halves response time. The tables are updated sparingly, and only with bulk COPYs. Now I'm looking for ways to optimize queries with other conditions.Reading around, seems many people are still using this 2005 snippet to obtain the row count estimate from Explain:CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$ DECLARE rec record; rows integer; BEGIN FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)'); EXIT WHEN rows IS NOT NULL; END LOOP; RETURN rows; END; $$ LANGUAGE plpgsql VOLATILE STRICT;
Is this still the current best practice? Any tips to increase precision? Currently it can estimate the actual number of rows for over or under a million, as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080 instead of 1,292,010).Any other tips to improve the query are welcome, of course. There's a big disparity between the two sample queries plans even though only the filtered country changes.I already raised default_statistics_target up to 2k (the planner wasn't using indexes at all with low values). Gotta get it even higher? These are my custom settings:shared_buffers = 256MB # min 128kBwork_mem = 128MB # min 64kBmaintenance_work_mem = 254MB # min 1MBeffective_cache_size = 2GBdefault_statistics_target = 2000random_page_cost = 1.0 # same scale as aboveSample query:
Explain Analyze
Select * from tenders inner join items on transaction_id = tender_transaction_id
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'QUERY PLAN Gather (cost=253837.99..1506524.32 rows=1955297 width=823) (actual time=51433.592..63239.809 rows=1001200 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=252837.99..1309994.62 rows=814707 width=823) (actual time=51361.920..61729.142 rows=333733 loops=3)
Hash Cond: (items.tender_transaction_id = tenders.transaction_id)
-> Parallel Seq Scan on items (cost=0.00..1048540.46 rows=3282346 width=522) (actual time=1.689..56887.108 rows=2621681 loops=3)
-> Parallel Hash (cost=247919.56..247919.56 rows=393475 width=301) (actual time=2137.473..2137.476 rows=333733 loops=3)
Buckets: 1048576 Batches: 1 Memory Usage: 219936kB
-> Parallel Bitmap Heap Scan on tenders (cost=16925.75..247919.56 rows=393475 width=301) (actual time=385.315..908.865 rows=333733 loops=3)
Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
Heap Blocks: exact=24350
-> Bitmap Index Scan on tenders_country_and_date_index (cost=0.00..16689.67 rows=944339 width=0) (actual time=423.213..423.214 rows=1001200 loops=1)
Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))Planning Time: 12.784 ms JIT: Functions: 33 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 14.675 ms, Inlining 383.349 ms, Optimization 1023.521 ms, Emission 651.442 ms, Total 2072.987 ms Execution Time: 63378.033 ms
Explain Analyze
Select * from tenders inner join items on transaction_id = tender_transaction_id
where country = 'Mexico'
and "date" >= '2023-01-01' and "date" < '2024-01-01'QUERY PLAN Gather (cost=1000.99..414258.70 rows=162080 width=823) (actual time=52.538..7006.128 rows=1292010 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.99..397050.70 rows=67533 width=823) (actual time=40.211..4087.081 rows=430670 loops=3)
-> Parallel Index Scan using tenders_country_and_date_index on tenders (cost=0.43..45299.83 rows=32616 width=301) (actual time=4.376..59.760 rows=1218 loops=3)
Index Cond: ((country = 'Mexico'::text) AND (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
-> Index Scan using items_tender_transaction_id_index on items (cost=0.56..10.67 rows=11 width=522) (actual time=0.321..3.035 rows=353 loops=3655)
Index Cond: (tender_transaction_id = tenders.transaction_id)Planning Time: 7.808 ms JIT: Functions: 27 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 17.785 ms, Inlining 0.000 ms, Optimization 5.080 ms, Emission 93.274 ms, Total 116.138 ms Execution Time: 7239.427 ms Thanks in advance!