I intend to understand further on PostgreSQL Index behavior on a "SELECT" statement.
We have a situation where-in Index on unique column is not being picked up as expected when used with-in the WHERE clause with other non-unique columns using AND operator.
explain SELECT tv.short_code, tv.chn as pkg_subscription_chn,
tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
FROM campaign_base ubs
JOIN tab_current_day_v2 tv
ON ubs.ubs_seq_id = tv.ubs_seq_id
AND tv.dt = CAST('2012-09-08' AS DATE)
GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;
QUERY PLAN
----------------------------------------------------------------------------------------------------
HashAggregate (cost=77754.57..77754.58 rows=1 width=38)
-> Nested Loop (cost=0.00..77754.56 rows=1 width=38)
-> Seq Scan on tab_current_day_v2 tv (cost=0.00..77746.26 rows=1 width=39)
Filter: (dt = '2012-09-08'::date)
-> Index Scan using cb_ubs_id_idx on campaign_base ubs (cost=0.00..8.28 rows=1 width=15)
Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)
The above plan shows "seq scan" on tab_current_day_v2 table, though there is an index on "ubs_seq_id" column which is an unique column.
Can anyone please help us understand, why PostgreSQL optimizer is not prioritizing the unique column and hitting ubs_seq_id_idx Index here ?
Later -
We have created composite Index on "dt" (one distinct value) and "ubs_seq_id" (no duplicate values) and the index has been picked up.
Below is the scenario where-in the same query's plan picking up the composite Index.
prod-db=# create index concurrently tab_dt_ubs_seq_id_idx on tab_current_day_v2(dt,ubs_seq_id);
CREATE INDEX
prod-db=# explain SELECT tv.short_code, tv.chn as pkg_subscription_chn,
tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
FROM campaign_base ubs
JOIN tab_current_day_v2 tv
ON ubs.ubs_seq_id = tv.ubs_seq_id
AND tv.dt = CAST('2012-09-08' AS DATE)
GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16.88..16.89 rows=1 width=38)
-> Nested Loop (cost=0.00..16.86 rows=1 width=38)
-> Index Scan using tab_dt_ubs_seq_id_idx on tab_current_day_v2 tv (cost=0.00..8.57 rows=1 width=39)
Index Cond: (dt = '2012-09-08'::date)
-> Index Scan using cb_ubs_id_idx on campaign_base ubs (cost=0.00..8.28 rows=1 width=15)
Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)
I was expecting the above behavior without a composite Index. A column with most unique values must be picked up when multiple columns are used in WHERE clause using AND operator. Any thoughts ?
prod-db# \d tab_current_day_v2
Table "public.tab_current_day_v2"
Column | Type | Modifiers
--------------------------+--------------------------+-----------
dt | date |
chn | character varying(10) |
vert | character varying(20) |
isdn | character varying |
bc | character varying(40) |
status | text |
is_rene | boolean |
age_in_sys | integer |
age_in_grace | integer |
has_prof | boolean |
short_code | character varying |
sub_vert | character varying(30) |
mode | character varying |
ubs_seq_id | bigint |
pkg_name | character varying(200) |
pkg_id | integer |
subs_charge | money |
subs_time | timestamp with time zone |
ulq_seq_id | bigint |
valid_till_time | timestamp with time zone |
valid_from_time | timestamp with time zone |
latest_ube_seq_id | bigint |
latest_pkg_id | integer |
price | integer |
Indexes:
"tab_dt_ubs_seq_id_idx" btree (dt, ubs_seq_id)
"tab_isdn_idx" btree (msisdn)
"tab_status_idx" btree (status)
"ubs_seq_id_idx" btree (ubs_seq_id)
Below is the table structure and the uniqueness of each of the columns.
airtel_user_data_oltp=# select attname, n_distinct from pg_Stats where tablename='tab_current_day_v2';
attname | n_distinct
--------------------------+------------
dt | 1
chn | 7
vert | 94
isdn | -0.727331
bc | 4
status | 3
is_rene | 2
age_in_sys | 1018
age_in_grac | 369
has_prof | 2
short_code | 23
sub_vert | 5
mode | 0
ubs_seq_id | -1
pkg_name | 461
pkg_id | 461
subs_charge | 7
subs_time | -1
ulq_seq_id | 122887
valid_till_time | -0.966585
valid_from_time | -0.962563
latest_ube_seq_id | -1
latest_pkg_id | 475
price | 18
(24 rows)
This is not an issue, but, would like to understand how PostgreSQL optimizer picks up Indexes in SELECT queries.
In an other scenario, we had used 4 columns in WHERE clause with AND operator with an Index on the column with most unique values -- The Index was picked up.
Looking forward for your help !
Regards,
VB
DISCLAIMER:
Please note that this message and any attachments may contain confidential and proprietary material and information and are intended only for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any review, use, disclosure, dissemination, distribution or copying of this message and any attachments is strictly prohibited. If you have received this email in error, please immediately notify the sender and delete this e-mail , whether electronic or printed. Please also note that any views, opinions, conclusions or commitments expressed in this message are those of the individual sender and do not necessarily reflect the views of Ver sé Innovation Pvt Ltd.