Re: Database migration and redesign

"Brandon Aiken" <BAiken@xxxxxxxxxxxxxxx> · Mon, 11 Sep 2006 14:55:41 -0400

Excellent, I managed to find one of the major sticking points all by
myself!  This is exactly what I was looking for.  Thanks!

There seems to be a lot of that in the DB world.  Practical vs
theoretical.  Or pragmatic vs strict.  It seems to be whether you came
from a math background -- in which case you're concerned with logical
data sets in the most effective theory possible -- or from a software
engineering background -- in which case you're concerned about the
usability and performance of the database software, particularly in how
other applications reference and access the database.

There are advantages to both schools.  A highly theoretical database can
also be highly normalized and therefore have very agile data models
built on top of them.  However, the practical limitation of computerized
relational databases means that they can't implement all features of the
relational database model very well or very efficiently.  Additionally,
while the logical models built on top of the theoretical DB structure
can be very agile, designing those models is very complex due to the
complex nature of the highly normalized relational structure involved.
If you have to do a multiple nested join of 10 tables just to run what
will be a query of the most basic object in the final app, you've
probably normalized way too far.

I think I can draw some lines of distinction now, though.  There are
some places where the index is just a means to relate the otherwise
unrelated tables (if that makes sense), and sometimes where using
natural keys will save several table joins on many queries.  This was
what I was seeing, and I was wondering if it was acceptable to mix it up
and do both where it seems to make the most sense.

My next task is a bit more difficult simply to define what the
relationship needs to be.  The app is business management software that
manages, among other things, Jobs and Quotes.  The relationship between
Jobs and Quotes is a little odd.  Quotes are sent to customers, and if
those customers accept the Quote then a Job is created and the Quote is
assigned to it.  So Quotes exist without Jobs.  Additionally, customers
will often request additional services after the original Quote and
these services get Quoted in the same way, so multiple Quotes can be
assigned to a single Job.  

It gets worse.  We use Internal Jobs for cost tracking, and these Jobs
do not have Quotes at all.   Also, there are some Jobs that get 'verbal'
quotes (that is, quotes given outside the normal quoting system).  So
Jobs exist without Quotes.

So, one Job can be associated with many Quotes, so Quote is the child of
the relationship even though it gets created first.  There can be 0 or 1
parents, and 0, 1, or many children in any combination.  I can simply
define foreign keys normally and use NULL values where no relationship
exists, but isn't there a better way to do this?  Allowing NULLs is one
of the problems with many of these tables.

Should I create another table with two fields, one for the Quote number
and one for the Job number, and then have those two fields comprise a
compound primary key (and Quote having a unique constraint for itself as
well)?  That way I don't think I can get logically orphaned children
since both tables are parents to this third table.  Is that how it's
supposed to be done?

There are several places across the DB where this style relationship
occurs, and I'd like to try to conform to best practices (especially
since the last guy managed to miss just about every single one).

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@xxxxxxxxx] 
Sent: Monday, September 11, 2006 1:08 PM
To: Brandon Aiken
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re: [GENERAL] Database migration and redesign

On 9/11/06, Brandon Aiken <BAiken@xxxxxxxxxxxxxxx> wrote:
> My question relates to primary keys.  The vast majority of tables have
a
> primary key on a single char or varchar field.  Is it considered
better
> practice to create a serial type id key to use as the primary key for
the
> table, and then create a unique index on the char and varchar fields?
> Should foreign keys reference the new primary id or the old unique
key?
> What about compound [primary] keys?

This is a somewhat controversial topic.  I personally am not a big fan
of Ambler or his approach to database design.  This is more or less
the surrogate/natural key debate which is a perennial flamewar in the
database world.

I tend to think natural keys are generally superior but surrogates are
useful in some situations.  Automatic use of ID column in every table
leads to lazy thinking and overcomplicated designs.  However at times
they can really optimize your database.  So I will lay down some
extremes and leave it up to you to find the middle ground.

A trivial case of when not to use a serial key is like this:
create table sex(id serial, sex char(1));
insert into sex(sex) select 'M' union all select 'F';

The reason is obvious, adding a join for no reason whenever you need
to know the sex, albeit a simple one, and expanding the 1 character
type to an int type.

A somewhat more interesting case is:
create table email(id serial, email text);

In this case, while it may seem like a waste to store the full email
in every table that references the email, you are optimizing the join
out in such cases, which can be a big win and since there is no other
properties of the email the email table only serves the purpose of
maintaining relational integrity, iow no duplicates.  In the event the
email changes, we allow RI to cleanup the other tables...an integer
proxy would be (at least to me) an meaningless abstraction of the
email.

A case of when not to use a natural key for relating is a bit more
complex, some times you  just get sick and tired of writing the key
fields over and over, or you have measured and determined the natural
key to be wasetful in terms of index performance.  Another reason is
if the p-key data changes frequently and the RI mechism is too
expensive.  In this case I would advise you to strictly key on the
natural and make a candidate, serial key.

create table foo
(
  nat_key1 text,
  nat_key2 int,
  foo_id serial,
  [...]
  primary key(nat_key1, nat_key2),
  unique(foo_id)  -- candidate unique key
);

This is my middle ground: there highly situational cases where an id
column is a win on pracital reasons.  however, a strong design around
natural keys tends to make you think the problem through much more
carefully and lead to a tighter database.

> Also, any suggestions for good DB design books would be appreciated.
I no
> longer have any good DB design books, and I don't know what's good
anymore.

http://www.amazon.com/exec/obidos/ASIN/0201485559/databasede095-20?creat
ive=327641&camp=14573&adid=07TEH0J3FS9SYN309QMS&link_code=as1

merlin