Re: Extended statistics for correlated columns, row estimates when values are not in MCVs list

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> · Mon, 7 Dec 2020 16:31:15 +0100

On 12/4/20 11:53 PM, Michael Lewis wrote:
> On version 12.5, I have a query similar to the below where I am getting
> rather unfortunate row estimates and a sub-optimal index choice as a result.
> 
> SELECT
>   id
> FROM
>   messages
> WHERE
>   client_id = 1234
>   and site_id = 889977
>   and message_type_id in ( 59, 62, 102, 162 )
>   and sent_on > NOW() - INTERVAL '3 days';
> 
> I created extended statistics (all types) on client_id and site_id,
> analyzed the columns, and in pg_stats_ext.dependencies (4 = client_id, 7 =
> site_id) have:
> {"4 => 7": 0.002997, "7 => 4": 0.896230}
> 
> When I check row estimates like below, I get results that indicate the
> columns are expected to be independent still.
> explain SELECT id FROM messages; --889594304
> explain SELECT id FROM messages where client_id = 1234; --133439
> explain SELECT id FROM messages where site_id = 889977; --28800
> explain SELECT id FROM messages where client_id = 1234 and site_id =
> 889977; --4
> 

I was a bit confused about this initially, because why would it still be
estimated as independent clauses, with the extended stats defined? But I
think the explanation is fairly simple - the combination of values is
simply rare enough not to be included in the MCV list, so it probably
gets estimated using the "regular" logic as if independent. But we still
mark the clauses as estimated, so the functional dependencies are not
really considered.

(I wonder if the code in PG14 would work better, but that's of little
use of course.)

> However, I pick a client & site ID pair which show up in the MCVs list,
> then I get the same estimate when querying for that site_id with or without
> including the client_id. That is great.
> 

I think this is consistent with the above explanation - in this case the
MCV actually kicks in, significantly improving the estimate.

> Is it reasonable to expect that if the correlation between two columns is
> rather high, then the optimizer might figure the columns are not
> independent and perhaps would give less weight to the value derived from
> independent column statistics? With table statistics, it is possible to set
> a static value or ratio for something like ndistinct. Any chance for
> something similar on dependency someday?
> 
> Perhaps I am expecting too much or have a poor understanding of what
> extended statistics can or someday might do. I deal with under estimates
> from correlations between client_id and sites or other similar
> dependent objects a fair bit and am hopeful to resolve some of those
> planning problems with extended stats, without the maintenance overhead of
> migrating everything to a single client per database to get more specific
> statistics.
> 

What you might try is defining the statistics with only the functional
dependencies. That should consider the column-level correlation even
when the combination of values is not in the MCV. It might make the
"good" estimate worse, but that's an inherent trade-off.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company