I'm not a programmer so understanding the optimizer code is WAY beyond my limits.
My question, that I haven't seen answered elsewhere, is WHAT things can affect the choice of an index scan over a sequence scan. I understand that sometimes a sequence scan is faster and that you still have to get the data from the disk but my question relates to an issue we had pop up today.
We have 2 tables, which we'll refer to as laaf and laaf_new. The first table has 220M rows and the second table has 4M rows. What were basically doing is aggregating the records from the first table into the second one at which point we're going to drop the first one. This is the same table I mentioned previously in my post about pg_dump.
laaf_new has one less column than laaf and both were freshly vacuum analyzed after having an index added on a single column (other than the primary key).
The query we were doing was as follows:
select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact_new
group by main_account_status_dim_id
order by main_account_status_dim_id;
One of our problems is that we don't have any PGSQL dbas here. All of our guys are DB2 (we're still looking though).
Now I've been told by our DBA that we should have been able to wholy satisfy that query via the indexes.
We did regular EXPLAINS on the query with seqscan enabled and disabled and even in our own tests actually running the queries, the results WERE faster with a seq scan than an index scan but the question we were discussing is WHY did it choose the index scan and why is the index scan slower than the sequence scan? He's telling me that DB2 would have been able to do the whole thing with indexes.
EXPLAINS:
(the reason for the random_page_cost was that we had the default of 4 in the .conf file and were planning on changing it to 2 anyway to match our other server)
set random_page_cost=2;
set enable_seqscan=on;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;
"Sort (cost=8774054.54..8774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=8774052.60..8774053.20 rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=0.00..7609745.40
rows=232861440 width=4)"
set random_page_cost=2;
set enable_seqscan=off;
explain select main_account_status_dim_id, count(*)
from cla_dw.loan_account_agg_fact
group by main_account_status_dim_id
order by main_account_status_dim_id;
"Sort (cost=108774054.54..108774054.66 rows=48 width=4)"
" Sort Key: main_account_status_dim_id"
" -> HashAggregate (cost=108774052.60..108774053.20
rows=48 width=4)"
" -> Seq Scan on loan_account_agg_fact (cost=100000000.00..107609745.40 rows=232861440 width=4)"
Here's the DDL for the table laaf:
When the system is not busy again, I'll run a verbose version. The query was run against each of the tables to compare the results of aggregation change with the new table.
CREATE TABLE cla_dw.loan_account_agg_fact
(
loan_account_agg_fact_id int8 NOT NULL DEFAULT nextval('loan_account_agg_fact_loan_account_agg_fact_id_seq'::regclass),
dw_load_date_id int4 NOT NULL DEFAULT 0,
servicer_branch_dim_id int4 NOT NULL DEFAULT 0,
main_account_status_dim_id int4 NOT NULL DEFAULT 0,
product_dim_id int4 NOT NULL DEFAULT 0,
next_due_date_id int4 NOT NULL DEFAULT 0,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) NOT NULL DEFAULT 0,
unaccruable_principal numeric(15,6) NOT NULL DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) NOT NULL DEFAULT 0,
past_due_interest numeric(16,5) NOT NULL DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_collected_date_id int4 DEFAULT 0,
dw_agg_load_dt timestamp(0) DEFAULT ('now'::text)::timestamp(6) with time zone,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey PRIMARY KEY (loan_account_agg_fact_id)
)
WITH OIDS;
CREATE INDEX loan_account_agg_fact_main_account_status_dim_id
ON cla_dw.loan_account_agg_fact
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part1_ts;
Here's the DDL for the table laaf_new:
CREATE TABLE cla_dw.loan_account_agg_fact_new
(
loan_account_agg_fact_id bigserial NOT NULL,
dw_load_date_id int4 NOT NULL,
servicer_branch_dim_id int4 NOT NULL,
main_account_status_dim_id int4 NOT NULL,
product_dim_id int4 NOT NULL,
dw_agg_load_dt timestamp,
account_balance numeric(15,6) NOT NULL DEFAULT 0,
loan_count int4 NOT NULL DEFAULT 0,
principal numeric(15,6) NOT NULL DEFAULT 0,
interest numeric(15,6) NOT NULL DEFAULT 0,
fees numeric(15,6) NOT NULL DEFAULT 0,
gl_principal numeric(15,6) NOT NULL DEFAULT 0,
gl_interest numeric(15,6) NOT NULL DEFAULT 0,
accruable_principal numeric(15,6) DEFAULT 0,
unaccruable_principal numeric(15,6) DEFAULT 0,
calculated_principal numeric(15,6) DEFAULT 0,
current_interest numeric(15,6) DEFAULT 0,
past_due_interest numeric(15,6) DEFAULT 0,
cash_available numeric(15,6) DEFAULT 0,
cash_collected numeric(15,6) DEFAULT 0,
cash_available_principal numeric(15,6) DEFAULT 0,
cash_available_current numeric(15,6) DEFAULT 0,
cash_available_last numeric(15,6) DEFAULT 0,
cash_available_interest numeric(15,6) DEFAULT 0,
cash_available_fees numeric(15,6) DEFAULT 0,
cash_not_collected numeric(15,6) DEFAULT 0,
number_contacts_total int4 DEFAULT 0,
number_broken_commitments int4 DEFAULT 0,
loc_current_due_total numeric(15,6) DEFAULT 0,
loc_current_due_principal numeric(15,6) DEFAULT 0,
loc_current_due_interest numeric(15,6) DEFAULT 0,
loc_current_due_fees numeric(15,6) DEFAULT 0,
loc_past_due_last numeric(15,6) DEFAULT 0,
loc_past_due_total numeric(15,6) DEFAULT 0,
number_made_commitments int4 DEFAULT 0,
CONSTRAINT loan_account_agg_fact_pkey_new PRIMARY KEY (loan_account_agg_fact_id) USING INDEX TABLESPACE default_ts
)
WITH OIDS TABLESPACE fact_data_part1_ts;
CREATE INDEX laafn_main_account_status_dim
ON cla_dw.loan_account_agg_fact_new
USING btree
(main_account_status_dim_id)
TABLESPACE fact_idx_part2_ts;