Re: Primary Key

Sam Mason <sam@xxxxxxxxxxxxx> · Sun, 18 Nov 2007 17:51:23 +0000

On Sun, Nov 18, 2007 at 04:19:01AM +0000, Ron Johnson wrote:
> If the PK was synthetic and generated by the engine, then a (buggy)
> app could insert duplicate tolls and the system wouldn't utter a
> peep.  But the customer sure would when he saw the duplicate entries.

You'd just need to put a UNIQUE constraint on an appropriate set of
columns and the database would complain in all the same places.

One big benefit that I've been able to discern with having natural,
over synthetic, keys is that you need to pull in every table up the
hierarchy when you want to find out something lower down.  An example
would probably help; say we were recording samples from locations, with
these samples being collected in groups on specific visits.  I've been
using synthetic keys more recently, so I'll start with them:

  CREATE TABLE locations (
    id SERIAL NOT NULL PRIMARY KEY,
    locntype TEXT
  );
  CREATE TABLE visits (
    id SERIAL PRIMARY KEY,
    locnid   INTEGER NOT NULL REFERENCES locations,
    visitnum INTEGER NOT NULL,
      UNIQUE (locnid,visitnum),
    visitdate DATE
  );
  CREATE TABLE samples (
    id SERIAL PRIMARY KEY,
    visitid   INTEGER NOT NULL REFERENCES visits,
    samplenum INTEGER NOT NULL,
      UNIQUE (visitid,samplenum),
    barcode TEXT UNIQUE
  );

Say I wanted to summarise the number of samples per location type, I'd
be forced to join onto the visits table; like this:

  SELECT l.locntype, COUNT(*)
  FROM locations l, visits v, samples s
  WHERE l.id = v.locnid
    AND v.id = s.visitid
  GROUP BY l.locntype;

If we now organise the tables using natural keys (note that the location
id is still pretty artificial as it has to map back to something in the
real world):

  CREATE TABLE locations (
    id TEXT NOT NULL PRIMARY KEY,
    locntype TEXT
  );
  CREATE TABLE visits (
    locnid   TEXT    NOT NULL REFERENCES locations,
    visitnum INTEGER NOT NULL,
      PRIMARY KEY (locnid,visitnum),
    visitdate DATE
  );
  CREATE TABLE samples (
    locnid    TEXT    NOT NULL,
    visitnum  INTEGER NOT NULL,
    samplenum INTEGER NOT NULL,
      PRIMARY KEY (locnid,visitnum,samplenum),
      FOREIGN KEY (locnid,visitnum) REFERENCES visits,
    barcode TEXT UNIQUE
  );

This has a nice simplifying effect on the above query; I can directly
refer to the location for each sample:

  SELECT l.locntype, COUNT(*) AS samples
  FROM locations l, samples s
  WHERE l.id = s.locnid
  GROUP BY l.locntype;

Is that a reasonable example of the different ways of laying things out
in the two styles?

The main advantage of synthetic keys is that it introduces another
layer of abstraction into the database, potentially allowing greater
flexibility (of course this can also be a burden).

Natural keys allow the above simplification of queries and ease the
manual fix-up of the data should something go horribly wrong.  It's
reasonably easy to fix things if you have a sample referencing a
non-existent visit with natural keys, but if you've got synthetic keys
you're probably going to have to dump the sample as well.  Of course,
if you've got yourself into this state you've already failed, but some
recourse is nice.

In summary; I see various advantages to either approach, but I don't
see either as being fundamentally "better".  You can express the same
constraints in either style (I'm sure someone will think of a good
counter example though), it's the context in which it's used that will
determine which is more suitable to the task at hand.  Using one style
exclusively is almost certainly bad, but having a preference for one or
the other is probably good as it'll make the database as a whole more
cohesive and subsequently ease maintenance.

Comments?

  Sam

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend