Re: Using varchar primary keys.

Gavan Schneider <pg-gts@xxxxxxxxxxx> · Mon, 1 Apr 2013 13:46:20 +1100

On 1/4/13 at 10:35 AM, Tim Uckun wrote:

Consider the following scenario.

Since I don't consider myself an expert I am talking in basic 
terms below. Please don't misunderstand this plodding as 
anything other than me attempting to be as clear as possible.

These comments reflect my understanding of Joe Celko's view as 
expressed in his books. I fully understand that many respectable 
SQL masters (respectfully) disagree with Celko in this. My take 
is to use his approach as a starting point and only deviate when 
a different approach is obviously much easier for me. (I doubt 
it's better. :). YMMV.

I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.

The tags table has two fields id and tag. id is a serial data type.

Celko would suggest the table name is the plural of its primary 
key, so a table named tags has tag as the name of its primary 
key. And this is a big hint: the sequential index is not needed. 
And he further (very strongly) suggests that primary keys assume 
their natural values and not be referenced by arbitrary 
(sequential) values, i.e., avoid pseudo-keys, as neither SQL nor 
modern data bases need them.

The taggings has a tag_id field along with the context, tagger_id etc.

So this would instead have two columns named "tag" and "tagger" 
(each referencing the column of the same name in tables "tags" 
and "taggers") and whatever is implied by the "etc". if the pair 
(tag,tagger) is meant to be unique then this is a good candidate 
for primary key, otherwise a further term(s) needs to be dragged 
into the composite index (e.g., a timestamp). Once again there 
is no logical need for a sequential integer to be stored as well.

....

Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.

Is this false economy?

I think so.

If the tags table has both sequential integer and the unique tag 
value then there is likely storage and I/O associated with two 
indexes, along with storage of both tag and associated integer. 
If the list of tags is short enough it might be suitable to use 
an ENUM to enforce both integrity and brevity.

If the list of tags is longer and/or might need updating then a 
single column table will allow for this. In general, I think 
lists of things (e.g., tag values) should be kept within the dB 
so they can be used for integrity checking, etc. I don't like 
the idea of externalising this job to the interface application. 
Also, as you have mentioned, you can elegantly handle table 
renovation, eg., if a given tag needs its name changed, it can 
be done via a foreign key constraint with cascade. (Elegant here 
refers to the syntax, the actual I/O implications may be very 
ugly depending on the scale of data that needs rewriting. :)

In another situation the tags table might well contain columns 
such as: tag, tag_full_name, tag_origin, 
tag_authority_certificate/expiry/whatever. In such an instance 
the tag is better if it's human readable/meaningful so reports 
can be useful without always back joining the full name. 
Sequential integers rarely fulfil this role as implied by the 
original question.

Regards
Gavan Schneider

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general