Hi,
Thank you very much for your answers and sorry for the delayed response.
> I checked and found that range types don't have "normal" statistics, and in > particular seem to use a poor ndistinct estimate.. > /* Estimate that non-null values are unique */ > stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
I investigated this idea and played around with the n_distinct value and you are absolutely right, the statistics do behave strangely with range types. Even creating statistics
(CREATE STATISTICS t_stats (mcv) ON daterange,spatial_feature_id FROM t ; ANALYZE t;)
doesn't change the fact.
I do get that range types were created with GIST and range comparison in mind, but as they are a really neat way to describe not only a date but also granularity dependency (i.e. "this data represent this exact week"), it would be really nice, if these data
types would work with primary keys and thus b-tree too.
In my case, I switched the daterange type with a BIGINT, which holds the exact same information on byte level. This value can then be immutably converted back to daterange and vice versa. This solved the problem for me.
Thank you very much for your time and help.
Best Regards
From: Justin Pryzby <pryzby@xxxxxxxxxxxxx>
Sent: Wednesday, July 22, 2020 4:40 PM To: Julian Wolf <julian.wolf@xxxxxxxxxxx> Cc: pgsql-performance Postgres Mailing List <pgsql-performance@xxxxxxxxxxxxxxxxxxxx> Subject: Re: Too few rows expected by Planner on partitioned tables On Tue, Jul 21, 2020 at 01:09:22PM +0000, Julian Wolf wrote:
> Hello, > > A description of what you are trying to achieve and what results you expect: > Our database is growing on a daily basis by about 2.5million rows per table (2 at the moment). Because of that, we decided to partition the data, especially, as we are pre-aggregating the data for weeks, months, quarters and years. Every aggregation is stored in a separate partition: > ... > Our problem is, that the planner always predicts one row to be returned, although only a part of the primary key is queried. This problem exceeds feasibility of performance rapidly - a query only involving a few days already takes dozens of seconds. All tables are analyzed and pg_stats looks reasonable IMHO. ... > PRIMARY KEY ( daterange, spatial_feature_id, visitor_profile_id, activity_type_combination_id, > activity_chain_id), ... > ) PARTITION BY LIST (daterange); > schemaname relname n_live_tup > mobility_insights location_statistics_y2019m03d 23569853 > mobility_insights location_statistics_y2019m03w 19264373 > mobility_insights location_statistics_y2019m03 18105295 > Aggregate (cost=2.79..2.80 rows=1 width=8) (actual time=143.073..143.073 rows=1 loops=1) > Buffers: shared hit=67334 > -> Index Scan using location_statistics_y2019m03w_pkey on location_statistics_y2019m03w st (cost=0.56..2.78 rows=1 width=8) (actual time=0.026..117.284 rows=516277 loops=1) > Index Cond: ((daterange = '[2019-03-04,2019-03-11)'::daterange) AND (spatial_feature_id = 12675)) > Buffers: shared hit=67334 I guess this isn't actually the problem query, since it takes 143ms and not dozens of seconds. I don't know what is the problem query, but maybe it might help to create an new index on spatial_feature_id, which could be scanned rather than scanning the unique index. Also, if daterange *and* spatial_feature_id are always *both* included, then this might work: postgres=# CREATE STATISTICS t_stats (mcv) ON daterange,spatial_feature_id FROM t ; ANALYZE t; -- Justin |