Re: 10+hrs vs 15min because of just one index

Tom Lane <tgl@xxxxxxxxxxxxx> · Sun, 12 Feb 2006 10:54:34 -0500

Aaron Turner <synfinatic@xxxxxxxxx> writes:
> Well before I go about re-architecting things, it would be good to
> have a strong understanding of just what is going on.  Obviously, the
> unique index on the char(48) is the killer.  What I don't know is:

You have another unique index on the integer primary key, so it's not
the mere fact of a unique index that's hurting you.

> 1) Is this because the column is so long?

Possibly.  Allowing for 12 bytes index-entry overhead, the char keys
would be 60 bytes vs 16 for the integer column, so this index is
physically almost 4x larger than the other.  You might say "but that
should only cause 4x more I/O" but it's not necessarily so.  What's
hard to tell is whether you are running out of RAM disk cache space,
resulting in re-reads of pages that could have stayed in memory when
dealing with one-fifth as much index data.  You did not show us the
iostat numbers for the two cases, but it'd be interesting to look at
the proportion of writes to reads on the data drive in both cases.

> 2) Is this because PG is not optimized for char(48) (maybe it wants
> powers of 2? or doesn't like even numbers... I don't know, just
> throwing it out there)

Are the key values really all 48 chars long?  If not, you made a
bad datatype choice: varchar(n) (or even text) would be a lot
smarter.  char(n) wastes space on blank-padding.

Another thing to think about is whether this is C locale or not.
String comparisons in non-C locales can be horrendously expensive
... though I'd expect that to cost CPU not I/O.  (Hmm ... is it
possible your libc is hitting locale config files constantly?
Might be worth strace'ing to confirm exactly where the I/O is
going.)

> 4) Does decoding the data (currently base64) and storing the binary
> data improve the distribution of the index, thereby masking it more
> efficent?

No, but it'd reduce the size of the index, which you certainly want.
Storing as bytea would also eliminate any questions about wasteful
locale-dependent comparisons.

The only one of these effects that looks to me like it could result in
worse-than-linear degradation of I/O demand is maxing out the available
RAM for disk cache.  So while improving the datatype choice would
probably be worth your while, you should first see if fooling with
shared_buffers helps, and if not it's time to buy RAM not disk.

			regards, tom lane