: PostgreSQL Index behavior

Venkat Balaji <venkat.balaji@xxxxxxxx> · Mon, 10 Sep 2012 18:09:35 +0530

Hello Community,
I intend to understand further on PostgreSQL Index behavior on a "SELECT" statement.

We have a situation where-in Index on unique column is not being picked up as expected when used with-in the WHERE clause with other non-unique columns using AND operator.

explain  SELECT tv.short_code, tv.chn as pkg_subscription_chn,
               tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
               CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
               FROM campaign_base ubs
               JOIN tab_current_day_v2 tv
               ON ubs.ubs_seq_id = tv.ubs_seq_id
             AND tv.dt = CAST('2012-09-08' AS DATE)
             GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;

                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 HashAggregate  (cost=77754.57..77754.58 rows=1 width=38)
   ->  Nested Loop  (cost=0.00..77754.56 rows=1 width=38)
         ->  Seq Scan on tab_current_day_v2 tv  (cost=0.00..77746.26 rows=1 width=39)
               Filter: (dt = '2012-09-08'::date)
         ->  Index Scan using cb_ubs_id_idx on campaign_base ubs  (cost=0.00..8.28 rows=1 width=15)
               Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)

The above plan shows "seq scan" on tab_current_day_v2 table, though there is an index on "ubs_seq_id" column which is an unique column. 

Can anyone please help us understand, why PostgreSQL optimizer is not prioritizing the unique column and hitting ubs_seq_id_idx Index here ?

Later -

We have created composite Index on "dt" (one distinct value) and "ubs_seq_id" (no duplicate values) and the index has been picked up.

Below is the scenario where-in the same query's plan picking up the composite Index.

prod-db=# create index concurrently tab_dt_ubs_seq_id_idx on tab_current_day_v2(dt,ubs_seq_id);
CREATE INDEX

prod-db=# explain  SELECT tv.short_code, tv.chn as pkg_subscription_chn,
               tv.vert as pkg_vert, ubs.campaign_id as campaign, 'none'::varchar as referer,
               CAST('CAMPAIGNWISE_SUBSCRIBER_BASE' AS VARCHAR) as vn, count(tv.msisdn) as n_count, '0'::numeric AS tot_revenue
            FROM campaign_base ubs
            JOIN tab_current_day_v2 tv
              ON ubs.ubs_seq_id = tv.ubs_seq_id
             AND tv.dt = CAST('2012-09-08' AS DATE)
             GROUP BY tv.short_code, tv.vert, tv.chn, ubs.campaign_id, vn;

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=16.88..16.89 rows=1 width=38)
   ->  Nested Loop  (cost=0.00..16.86 rows=1 width=38)
         ->  Index Scan using tab_dt_ubs_seq_id_idx on tab_current_day_v2 tv  (cost=0.00..8.57 rows=1 width=39)
               Index Cond: (dt = '2012-09-08'::date)
         ->  Index Scan using cb_ubs_id_idx on campaign_base ubs  (cost=0.00..8.28 rows=1 width=15)
               Index Cond: (ubs.ubs_seq_id = tv.ubs_seq_id)
(6 rows)

I was expecting the above behavior without a composite Index. A column with most unique values must be picked up when multiple columns are used in WHERE clause using AND operator. Any thoughts ?

prod-db# \d tab_current_day_v2

         Table "public.tab_current_day_v2"
          Column          |           Type           | Modifiers
--------------------------+--------------------------+-----------
 dt                       | date                     |
 chn                      | character varying(10)    |
 vert                     | character varying(20)    |
 isdn           	  | character varying        |
 bc		          | character varying(40)    |
 status                   | text                     |
 is_rene                  | boolean                  |
 age_in_sys               | integer                  |
 age_in_grace             | integer                  |
 has_prof                 | boolean                  |
 short_code               | character varying        |
 sub_vert                 | character varying(30)    |
 mode                     | character varying        |
 ubs_seq_id               | bigint                   |
 pkg_name                 | character varying(200)   |
 pkg_id                   | integer                  |
 subs_charge              | money                    |
 subs_time                | timestamp with time zone |
 ulq_seq_id               | bigint                   |
 valid_till_time          | timestamp with time zone |
 valid_from_time          | timestamp with time zone |
 latest_ube_seq_id        | bigint                   |
 latest_pkg_id            | integer                  |
 price                    | integer                  |
Indexes:
    "tab_dt_ubs_seq_id_idx" btree (dt, ubs_seq_id)
    "tab_isdn_idx" btree (msisdn)
    "tab_status_idx" btree (status)
    "ubs_seq_id_idx" btree (ubs_seq_id)

Below is the table structure and the uniqueness of each of the columns.

airtel_user_data_oltp=# select attname, n_distinct from pg_Stats where tablename='tab_current_day_v2';

         attname          | n_distinct
--------------------------+------------
 dt                       |          1
 chn                      |          7
 vert                     |         94
 isdn                     |  -0.727331
 bc                       |          4
 status                   |          3
 is_rene                  |          2
 age_in_sys               |       1018
 age_in_grac              |        369
 has_prof                 |          2
 short_code               |         23
 sub_vert                 |          5
 mode                     |          0
 ubs_seq_id               |         -1
 pkg_name                 |        461
 pkg_id                   |        461
 subs_charge              |          7
 subs_time                |         -1
 ulq_seq_id               |     122887
 valid_till_time          |  -0.966585
 valid_from_time          |  -0.962563
 latest_ube_seq_id        |         -1
 latest_pkg_id            |        475
 price                    |         18

(24 rows)

This is not an issue, but, would like to understand how PostgreSQL optimizer picks up Indexes in SELECT queries.

In an other scenario, we had used 4 columns in WHERE clause with AND operator with an Index on the column with most unique values -- The Index was picked up.

Looking forward for your help !

Regards,
VB

DISCLAIMER:

Please note that this message and
any attachments may contain confidential and proprietary material and
information and are intended only for the use of the intended recipient(s). If
you are not the intended recipient, you are hereby notified that any review,
use, disclosure, dissemination, distribution or copying of this message and any
attachments is strictly prohibited. If you have received this email in error,
please immediately notify the sender and delete this e-mail , whether
electronic or printed. Please also note that any views, opinions, conclusions
or commitments expressed in this message are those of the individual sender and
do not necessarily reflect the views of Ver sé Innovation Pvt Ltd.