extended statistics n-distinct on multiple columns not used when join two tables

"James Pang (chaolpan)" <chaolpan@xxxxxxxxx> · Tue, 13 Jun 2023 09:21:25 +0000

Hi,
     When join two table on multiple columns equaljoin, rows estimation always use selectivity = multiplied by distinct multiple individual columns, possible to use  extended n-distinct statistics on multiple columns?

    PG v14.8-1, attached please check test case with details.

Thanks,

James 

testdb=> \d test1
                          Table "csdigital.test1"
     Column      |          Type          | Collation | Nullable | Default
-----------------+------------------------+-----------+----------+---------
 account_key     | character varying(36)  |           | not null |
 subscription_id | character varying(36)  |           | not null |
 site_key        | character varying(36)  |           | not null |
 site_id         | numeric(38,0)          |           |          |
 site_name       | character varying(128) |           |          |
Indexes:
    "test1_pk" PRIMARY KEY, btree (account_key, subscription_id, site_key)

testdb=> \d test2
                          Table "csdigital.test2"
     Column      |          Type          | Collation | Nullable | Default
-----------------+------------------------+-----------+----------+---------
 account_key     | character varying(36)  |           | not null |
 subscription_id | character varying(36)  |           | not null |
 offer_code      | character varying(128) |           | not null |
 offer_name      | character varying(128) |           |          |
 offer_type      | character varying(128) |           |          |
Indexes:
    "test2_pk" PRIMARY KEY, btree (account_key, subscription_id, offer_code)

testdb=>  explain select count(*) from test1 t1,test2 t2 where t1.account_key=t2.account_key and t1.subscription_id=t2.subscription_id;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=629339.31..629339.32 rows=1 width=8)
   ->  Merge Join  (cost=1.99..629339.26 rows=22 width=0)             
         Merge Cond: (((t1.account_key)::text = (t2.account_key)::text) AND ((t1.subscription_id)::text = (t2.subscription_id)::text))
         ->  Index Only Scan using test1_pk on test1 t1  (cost=0.56..263134.00 rows=4906772 width=61)
         ->  Index Only Scan using test2_pk on test2 t2  (cost=0.56..309937.54 rows=6346934 width=59)
(5 rows)

Time: 8.018 ms
testdb=> SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, inherited, null_frac, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, correlation
testdb->  FROM pg_stats WHERE attname in ('account_key','subscription_id') AND tablename='test1' ORDER BY 1 DESC;
   frac_mcv    | tablename |     attname     | inherited | null_frac | n_distinct | n_mcv | n_hist | correlation
---------------+-----------+-----------------+-----------+-----------+------------+-------+--------+--------------
 0.00026666667 | test1     | account_key     | f         |         0 |     249474 |     1 |    101 | -0.008081022
 0.00026666667 | test1     | subscription_id | f         |         0 | -0.6047012 |     1 |    101 |   0.26008433
(2 rows)

Time: 7.760 ms
testdb=> SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, inherited, null_frac, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, correlation
testdb->  FROM pg_stats WHERE attname in ('account_key','subscription_id') AND tablename='test2' ORDER BY 1 DESC;
 frac_mcv | tablename |     attname     | inherited | null_frac | n_distinct  | n_mcv | n_hist | correlation
----------+-----------+-----------------+-----------+-----------+-------------+-------+--------+-------------
          | test2     | account_key     | f         |         0 |      325385 |       |    101 | 0.006174604
          | test2     | subscription_id | f         |         0 | -0.68929785 |       |    101 | 0.123789854
(2 rows)

Time: 1.423 ms
testdb=> select relname,reltuples from pg_class where relname in ('test1','test2');
 relname |  reltuples
---------+--------------
 test1   | 4.906772e+06
 test2   | 6.346934e+06
(2 rows)

Time: 0.746 ms
testdb=> create statistics cs_contract_site_map_account_key_subscription_id(dependencies, ndistinct) on account_key, subscription_id from test1;
CREATE STATISTICS
Time: 7.918 ms
testdb=> create statistics cs_contract_account_key_subscription_id(dependencies, ndistinct) on account_key, subscription_id from test2;
CREATE STATISTICS
Time: 1.558 ms
testdb=> analyze test1;
aANALYZE
Time: 745.952 ms
testdb=> analyze test2;
ANALYZE
Time: 606.735 ms
testdb=> select st.stxrelid,st.stxname,st.stxstattarget,st.stxkeys,std.stxdndistinct,std.stxddependencies
testdb-> from pg_statistic_ext st join pg_statistic_ext_data std on st.oid=std.stxoid;
ERROR:  permission denied for table pg_statistic_ext_data
Time: 2.464 ms

testdb=# select st.stxrelid,st.stxname,st.stxstattarget,st.stxkeys,std.stxdndistinct,std.stxddependencies
testdb-# from pg_statistic_ext st join pg_statistic_ext_data std on st.oid=std.stxoid;
 stxrelid |                     stxname                      | stxstattarget | stxkeys |   stxdndistinct   |             stxddependencies
----------+--------------------------------------------------+---------------+---------+-------------------+------------------------------------------
  5920481 | cs_contract_site_map_account_key_subscription_id |            -1 | 1 2     | {"1, 2": 3552572} | {"1 => 2": 0.891333, "2 => 1": 0.997467}
  5920675 | cs_contract_account_key_subscription_id          |            -1 | 1 2     | {"1, 2": 6173026} | {"1 => 2": 0.917700, "2 => 1": 0.997400}
(2 rows)

testdb=> explain select count(*) from test1 t1,test2 t2 where t1.account_key=t2.account_key and t1.subscription_id=t2.subscription_id;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=629329.20..629329.21 rows=1 width=8)
   ->  Merge Join  (cost=1.28..629329.15 rows=23 width=0)  <<< here we expect extended ndistinct used to rows = much more value, but still use distinct(account_key)*distinct(subscription_id)
         Merge Cond: (((t1.account_key)::text = (t2.account_key)::text) AND ((t1.subscription_id)::text = (t2.subscription_id)::text))
         ->  Index Only Scan using test1_pk on test1 t1  (cost=0.56..263134.59 rows=4906790 width=61)
         ->  Index Only Scan using test2_pk on test2 t2  (cost=0.56..309935.05 rows=6346754 width=59)
(5 rows)

Time: 4.307 ms