On 1/4/13 at 10:35 AM, Tim Uckun wrote:
Consider the following scenario.
Since I don't consider myself an expert I am talking in basic
terms below. Please don't misunderstand this plodding as
anything other than me attempting to be as clear as possible.
These comments reflect my understanding of Joe Celko's view as
expressed in his books. I fully understand that many respectable
SQL masters (respectfully) disagree with Celko in this. My take
is to use his approach as a starting point and only deviate when
a different approach is obviously much easier for me. (I doubt
it's better. :). YMMV.
I have a typical tagging structure. There is a table called tags, there is
a table called taggings. The taggings table acts as a many to many join
table between the taggers and the tags.
The tags table has two fields id and tag. id is a serial data type.
Celko would suggest the table name is the plural of its primary
key, so a table named tags has tag as the name of its primary
key. And this is a big hint: the sequential index is not needed.
And he further (very strongly) suggests that primary keys assume
their natural values and not be referenced by arbitrary
(sequential) values, i.e., avoid pseudo-keys, as neither SQL nor
modern data bases need them.
The taggings has a tag_id field along with the context, tagger_id etc.
So this would instead have two columns named "tag" and "tagger"
(each referencing the column of the same name in tables "tags"
and "taggers") and whatever is implied by the "etc". if the pair
(tag,tagger) is meant to be unique then this is a good candidate
for primary key, otherwise a further term(s) needs to be dragged
into the composite index (e.g., a timestamp). Once again there
is no logical need for a sequential integer to be stored as well.
....
Given that you can still set foreign keys and cascade commands to adjust
child records either way the tables would be properly normalized so I am
wondering what I am gaining by using these serial ID fields.
Is this false economy?
I think so.
If the tags table has both sequential integer and the unique tag
value then there is likely storage and I/O associated with two
indexes, along with storage of both tag and associated integer.
If the list of tags is short enough it might be suitable to use
an ENUM to enforce both integrity and brevity.
If the list of tags is longer and/or might need updating then a
single column table will allow for this. In general, I think
lists of things (e.g., tag values) should be kept within the dB
so they can be used for integrity checking, etc. I don't like
the idea of externalising this job to the interface application.
Also, as you have mentioned, you can elegantly handle table
renovation, eg., if a given tag needs its name changed, it can
be done via a foreign key constraint with cascade. (Elegant here
refers to the syntax, the actual I/O implications may be very
ugly depending on the scale of data that needs rewriting. :)
In another situation the tags table might well contain columns
such as: tag, tag_full_name, tag_origin,
tag_authority_certificate/expiry/whatever. In such an instance
the tag is better if it's human readable/meaningful so reports
can be useful without always back joining the full name.
Sequential integers rarely fulfil this role as implied by the
original question.
Regards
Gavan Schneider
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general