On Wed, May 4, 2011 at 2:25 AM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote: > David Johnston wrote: >> >> Is there any rules-of-thumb on the performance of a PK as a function of >> key length? I like using varchar based identifiers since I tend to query >> tables directly and writing where clauses is much easier if you can avoid >> the joins. I'm likely better off creating views and querying those but am >> still curious on any basic thoughts on having a 100+ length primary key. >> > > The shorter the better, but it may not be as bad as you fear. The way > B-tree indexes are built, it isn't that expensive to hold a longer key so > long as the unique part doesn't average out to be that long. So if you > insert "123456666666666666666" and "12345777777777777777", that's not going > to be much different than navigating "123456" and "123457", because once you > get that far you've already reached a unique prefix. But if your entries > have a really long common prefix, like "111111111111111112" and > "111111111111111113", that's going to be more expensive to deal with--even > though the strings are the same length. > > If your identifiers become unique after only a few characters, it may not be > so bad. But if they go many characters before you can distinguish between > any two entries, you're probably not going to be happy with the performance > or size of the indexes, relative to simple integer keys. yeah. The number of comparisons should be basically the same, but situational things are going to make/break you. As the in house 'performance guy', it might interest you to work through them all -- indexing strategies are the key to good database performance and, modeling concepts and religious debates aside, this is an interesting discussion from a strictly performance point of view. One reason why natural keys work much better than expected is you get much better index utilization and potentially *much* better tuple/page efficiency on certain very common classes of lookups/scans especially if you cluster. Speaking of clustering, you no longer have to agonize in cases of say, having to cluster on 'email' or 'email_id'. The identifying and ordering/searching criteria are in the same index which can be an enormous win in some cases. Do not underestimate the value of this when the table is large and dependent scans are common. If you key on email and the query coming from another table doesn't need any other email properties, you just saved yourself a join without having to de-normailze in the classic sense. You also get to cut out many sorts on similar principles. OTOH, updates as noted tend to suck. In some cases hilariously so. The larger index is going to cause more cache pressure which is a point against...sometimes the large index just isn't worth the cost for what you get. Also, you have a hard limit on key sizes imposed by postgres. It's rare to hit that in the real world but it should be noted. Another negative point is that the postgres stats system also doesn't deal well with composite keys for range scans. I have a pretty good idea on how to fix this, but I haven't gotten around to it yet. merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general