Re: pervasiveness of surrogate (also called synthetic) keys

Merlin Moncure <mmoncure@xxxxxxxxx> · Wed, 4 May 2011 09:32:15 -0500

On Wed, May 4, 2011 at 2:25 AM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote:
> David Johnston wrote:
>>
>> Is there any rules-of-thumb on the performance of a PK as a function of
>> key length?  I like using varchar based identifiers since I tend to query
>> tables directly and writing where clauses is much easier if you can avoid
>> the joins.  I'm likely better off creating views and querying those but am
>> still curious on any basic thoughts on having a 100+ length primary key.
>>
>
> The shorter the better, but it may not be as bad as you fear.  The way
> B-tree indexes are built, it isn't that expensive to hold a longer key so
> long as the unique part doesn't average out to be that long.  So if you
> insert "123456666666666666666" and "12345777777777777777", that's not going
> to be much different than navigating "123456" and "123457", because once you
> get that far you've already reached a unique prefix.  But if your entries
> have a really long common prefix, like "111111111111111112" and
> "111111111111111113", that's going to be more expensive to deal with--even
> though the strings are the same length.
>
> If your identifiers become unique after only a few characters, it may not be
> so bad.  But if they go many characters before you can distinguish between
> any two entries, you're probably not going to be happy with the performance
> or size of the indexes, relative to simple integer keys.

yeah.  The number of comparisons should be basically the same, but
situational things are going to make/break you.  As the in house
'performance guy', it might interest you to work through them all --
indexing strategies are the key to good database performance and,
modeling concepts and religious debates aside, this is an interesting
discussion from a strictly performance point of view.

One reason why natural keys work much better than expected is you get
much better index utilization and potentially *much* better tuple/page
efficiency on certain very common classes of lookups/scans especially
if you cluster.   Speaking of clustering, you no longer have to
agonize in cases of say, having to cluster on 'email' or 'email_id'.
The identifying and ordering/searching criteria are in the same index
which can be an enormous win in some cases.  Do not underestimate the
value of this when the table is large and dependent scans are common.

If you key on email and the query coming from another table doesn't
need any other email properties, you just saved yourself a join
without having to de-normailze in the classic sense.  You also get to
cut out many sorts on similar principles.

OTOH, updates as noted tend to suck.  In some cases hilariously so.
The larger index is going to cause more cache pressure which is a
point against...sometimes the large index just isn't worth the cost
for what you get. Also, you have a hard limit on key sizes imposed by
postgres.  It's rare to hit that in the real world but it should be
noted.  Another negative point is that the postgres stats system also
doesn't deal well with composite keys for range scans.  I have a
pretty good idea on how to fix this, but I haven't gotten around to it
yet.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general