From: Kevin Galligan
[mailto:kgalligan@xxxxxxxxx]
Sent: Wednesday, October 29, 2008 4:34 PM
To: Dann Corbit
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re: FW: Slow query performance
Sorry for the lack of
detail. Index on both state and age. Not a clustered on both as the
queries are fairly arbitrary (that's the short answer. The long answer is
that, at least with those columns, something like that MAY be an option later but
I don't know enough now).
I don't have the gui admin set up, as I'm doing this over ssh. Will get
the full table definition in a bit. The short answer is simple btree
indexes on the columns being searched. I was applying simple indexes on
all the columns, as there will be queries like "where [col] is not
null", although in retrospect, that's fairly pointless unless the column
has very little data. Even then, maybe.
Anyway, looking at the output, the time goes from 6727.848 to 387159.175 during
the bitmap heap scan (I was reading it wrong about the Aggregate line).
Considering the size involved, is this something that postgre has decided is
too big to be done in memory? That would be my wild guess.
Ran another query. this one even simpler. Still quite long...
explain analyze select count(*) from bigdatatable where age between 22 and 23
and state = 'NY';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=37.41..37.42 rows=1 width=0) (actual
time=217998.706..217998.707 rows=1 loops=1)
-> Index Scan using idx_jage on bigdatatable
(cost=0.00..37.41 rows=1 width=0) (actual time=247.209..217988.584 rows=10303
loops=1)
Index Cond: ((age >= 22)
AND (age <= 23))
Filter: ((state)::text =
'NY'::text)
Total runtime: 217998.800 ms
Abbreviated schema below. The table is huge. Originally I had a
design with a main "anchor" table that had all records, and most of
those columns were in other tables I would join for the search. This
didn't perform very well, so I decided to go the other way and see how it
worked. I did a count on all data, and anything with records in fewer
than 5 percent of the rows, I put in their own table. Everything else is
in this big one. The indexing strategy is a joke right now. I just
applied one to each. This is still in the testing phase.
I had set this up on mysql. The joins on the full size db turned out to
be terrible. I'm currently setting up the "one large table"
design on mysql to see how that works.
The obvious answer would be that the table is huge, so when the query is
running, its grabbing all that data and not using much of it. True.
However, I'm not sure how to approach the design now. Its rarely going to
need data from anything other than a few columns, but joining across 10's or
100's of millions of records didn't seem that much fun either. Thoughts?
CREATE TABLE bigdatatable (
account integer,
city character varying(20),
zip character(5),
dincome character(1),
sex character(1),
mob boolean,
religion character(1),
groupcd character(1),
lastdata character varying(4),
countycd character varying(3),
state character varying(2),
dutype character varying(1),
orders integer,
countysz character varying(1),
language character varying(2),
cbsacode character varying(5),
cbsatype character varying(1),
age smallint,
dob date,
ccard boolean,
lor integer,
bankcard boolean,
lastord date,
total integer,
lmob boolean,
homeown character varying(1),
ord1st date,
ordlast date,
married boolean,
deptcard boolean,
ordtotm smallint,
ordlastm date,
ord1stm date,
orddolm smallint,
pcuser boolean,
homeval character varying(1),
mailresp boolean,
lhomepc boolean,
dirrspby boolean,
mgift boolean,
lebuyer boolean,
payother smallint,
lhomdecr boolean,
driver boolean,
ordtote smallint,
ord1ste date,
ordlaste date,
orddole smallint,
mhmdecor boolean,
oddsnend smallint,
aptot smallint,
apwk smallint,
apdol smallint,
payccrd smallint,
landval smallint,
mfapparl boolean,
mgengift boolean,
homeblt smallint,
homebydt date,
educate character varying(1),
children boolean,
payvisa smallint,
hmfr smallint,
maghlth smallint,
homebypx integer,
gfhol smallint,
mbeauty boolean,
apwmtot smallint,
apwmwk smallint,
apwmdol smallint,
travlseg integer,
lhealth boolean,
lcharity boolean,
moutdoor boolean,
occupatn character varying(4),
fundrais boolean,
msports boolean,
hg smallint,
magfam smallint,
melectrc boolean,
lelectrc boolean,
bankcrd1 boolean,
lfoodck boolean,
mfood boolean,
finance boolean,
hmfrntot smallint,
hmfrnwk smallint,
hmfrndol smallint,
paymastr smallint,
lgarden boolean,
lartantq boolean,
martantq boolean,
hb smallint,
mmaleap boolean,
lhwrkshp boolean,
hghmctot smallint,
hmhmcwk smallint,
hghmcdol smallint,
paycash smallint,
studntln boolean,
lwelness boolean,
opportun smallint,
gftot smallint,
gfwk smallint,
giftinst smallint,
gfdol smallint,
mchildrn boolean,
chtot smallint,
chwk smallint,
chdol smallint,
hbhltot smallint,
hbhltwk smallint,
lifestyl smallint,
hmhwrtot smallint,
hmhwrwk smallint,
lsports boolean,
hmhwrdol smallint,
hbhltdol smallint,
mfemplus boolean,
mhightkt boolean,
apmntot smallint,
apmnwk smallint,
sltot smallint,
slwk smallint,
hmlintot smallint,
hmlinwk smallint,
hmlindol smallint,
sldol smallint,
genmwk smallint,
genmtot smallint,
genmdol smallint,
apmndol smallint,
chprods smallint,
lfitness boolean,
lculturl boolean
);
CREATE INDEX idx_jage ON bigdatatable USING btree (age);
CREATE INDEX idx_japdol ON bigdatatable USING btree (apdol);
CREATE INDEX idx_japtot ON bigdatatable USING btree (aptot);
CREATE INDEX idx_japwk ON bigdatatable USING btree (apwk);
CREATE INDEX idx_japwmtot ON bigdatatable USING btree (apwmtot);
CREATE INDEX idx_jbankcard ON bigdatatable USING btree (bankcard);
CREATE INDEX idx_jcbsacode ON bigdatatable USING btree (cbsacode);
CREATE INDEX idx_jcbsatype ON bigdatatable USING btree (cbsatype);
CREATE INDEX idx_jccard ON bigdatatable USING btree (ccard);
CREATE INDEX idx_jchildren ON bigdatatable USING btree (children);
CREATE INDEX idx_jcountycd ON bigdatatable USING btree (countycd);
CREATE INDEX idx_jcountysz ON bigdatatable USING btree (countysz);
CREATE INDEX idx_jdeptcard ON bigdatatable USING btree (deptcard);
CREATE INDEX idx_jdirrspby ON bigdatatable USING btree (dirrspby);
CREATE INDEX idx_jdob ON bigdatatable USING btree (dob);
CREATE INDEX idx_jdriver ON bigdatatable USING btree (driver);
CREATE INDEX idx_jdutype ON bigdatatable USING btree (dutype);
CREATE INDEX idx_jeducate ON bigdatatable USING btree (educate);
CREATE INDEX idx_jfundrais ON bigdatatable USING btree (fundrais);
CREATE INDEX idx_jgfhol ON bigdatatable USING btree (gfhol);
CREATE INDEX idx_jhmfr ON bigdatatable USING btree (hmfr);
CREATE INDEX idx_jhomeblt ON bigdatatable USING btree (homeblt);
CREATE INDEX idx_jhomebydt ON bigdatatable USING btree (homebydt);
CREATE INDEX idx_jhomeown ON bigdatatable USING btree (homeown);
CREATE INDEX idx_jhomeval ON bigdatatable USING btree (homeval);
CREATE INDEX idx_jlandval ON bigdatatable USING btree (landval);
CREATE INDEX idx_jlanguage ON bigdatatable USING btree (language);
CREATE INDEX idx_jlastord ON bigdatatable USING btree (lastord);
CREATE INDEX idx_jlebuyer ON bigdatatable USING btree (lebuyer);
CREATE INDEX idx_jlhealth ON bigdatatable USING btree (lhealth);
CREATE INDEX idx_jlhomdecr ON bigdatatable USING btree (lhomdecr);
CREATE INDEX idx_jlhomepc ON bigdatatable USING btree (lhomepc);
CREATE INDEX idx_jlmob ON bigdatatable USING btree (lmob);
CREATE INDEX idx_jlor ON bigdatatable USING btree (lor);
CREATE INDEX idx_jmaghlth ON bigdatatable USING btree (maghlth);
CREATE INDEX idx_jmailresp ON bigdatatable USING btree (mailresp);
(a bunch more in here...)
CREATE INDEX idx_jstate ON bigdatatable USING btree (state);
CREATE INDEX idx_jtotal ON bigdatatable USING btree (total);
>>
These indexes are not going
to be very selective. For instance, if you get the state index, you will
scan 1/50th of the data on average and for big states like NY, CA,
TX a lot more than 1/50th. Similarly for AGE, especially when
you ask for a range. Wading through a even a small percentage of the data
using an index is surprisingly expensive when neither the data nor the index is
clustered. Indexes come into their own when you trim the data by several
orders of magnitude with a specific sort of query.
If you have an index on (STATE,
AGE) (for instance) then you will filter through perhaps 1/50th
as much as you would with a single column index.
Having a zillion indexes like
that will make updates really slow.
It might be worth considering
a column database model.
<<