Re: pervasiveness of surrogate (also called synthetic) keys

Merlin Moncure <mmoncure@xxxxxxxxx> · Tue, 3 May 2011 08:42:02 -0500

On Mon, May 2, 2011 at 11:53 PM, Craig Ringer
<craig@xxxxxxxxxxxxxxxxxxxxx> wrote:
> On 03/05/11 11:07, Greg Smith wrote:
>
>> That doesn't mean you can't use
>> them as a sort of foreign key indexing the data; it just means you can't
>> make them the sole unique identifier for a particular entity, where that
>> entity is a person, company, or part.
>
> Classic case: a database here has several tables indexed by MAC address.
> It's used for asset reporting and software inventory.
>
> Problem: VMs generate random MAC addresses by default. They're not
> guaranteed to be globally unique. Collisions have happened and will
> probably happen again. In this case, it wasn't a big deal, but it just
> goes to show that even the "obviously" globally unique isn't necessarily so.

It's precisely pathological cases like this where uniqueness
constraints are important and should be used.  By the way, we aren't
debating the use of natural case but whether to define uniqueness
constraints.  My main gripe with surrogates is that their use often
leads directly to lazy schema design where uniqueness constraints are
not defined which leads to data problems exactly like the case you
described above.

In a purely surrogate table with no unique on the mac, suppose you
have two records with the same value for the address, and there are no
other interesting fields on the table or elsewhere in the database:
*) who/what made the decision to place a second record on the table?
*) is that decision verifiable?
*) Is that decision repeatable?
*) are there other routes of data entry into the database that bypass
that decision?  will there ever be?
*) what happens when the code that represents that decision has or
develops a bug?
*) why would you not want information supporting that decision in the database?
*) how do you know the tables keying to your mac table are pointing to
the correct record?
*) what are the consequences for ad hoc queries that join directly
against the mac?  DISTINCT?

If your data modeler that made the the assumptions that a MAC is
unique (a mistake obviously) at least the other tables are protected
from violations of that assumption because the database would reject
them with an error, which is a Good Thing.  Without a uniqueness
constraint you now have ambiguous data which is a Very Bad Thing.
Without proper unique constraints, a generated key is effectively
saying "well, I cant' figure this out right now...I'll deal with it
later".  That unmanaged complexity is now in the application and all
the queries that touch the database...you've created your own bug
factory.  With a uniqueness constraint, you have a rigorous definition
of what your record represents, and other entities in the database can
now rely on that definition.

Natural key designs are good for a lot of reasons, but #1 on the least
is that they force you to deal with problems in your data model up
front because they force you to define unqiueness.  If the MAC turns
out not to be unique and the problem is not in fact coming from the
input data or the application, yes, you do have to correct the model
but at least the data inside the database is clean, and can be
unambiguously mapped to the new model.  I'll take schema changes over
bad data.

Correcting the model means you have to figure out whatever information
is used to distinguish identical MACs #1 and #2 is stored in the
database because now your data and the corresponding decisions are
verifiable, repeatable, unambiguous, etc.  What extra field you have
to add to your 'mac' table to make it unique would depend on certain
things, but it's certainly a solvable problem, and when solved would
give you a more robust database.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general