On Mon, May 2, 2011 at 10:10 AM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote: > On 05/01/2011 06:12 PM, Karsten Hilbert wrote: >> >> Good to know since I'm only a lowly medical doctor not >> having much schooling in database matters beyond this list, >> the PostgreSQL docs, and the Celko book. >> > > This debate exists at all levels of experience, and the only thing that > changes as you get more experienced people involved is an increase in > anecdotes on each side. The sole time I ever found myself arguing with Joe > Celko is over an article he wrote recommending natural keys, using an > example from the automotive industry. Problem was, the specific example he > gave was flat out wrong. I was working in automotive MIS at the time, and > the thing he was saying would never change did, in fact, change every > year--in only a fraction of a percent of cases, in an extremely subtle way > that snuck up on people and wreaked much confusion. That's typical for an > early natural key design: you get it working fine in V1.0, only to discover > months or years down the road there's a case you never considered you don't > model correctly, and it may take some sort of conversion to fix. > > The reason why there's a strong preference for surrogate keys is that they > always work and you can avoid ever needing to come up with a better design. > if you just use them and forget about it. The position Merlin has advocated > here, that there should always be a natural key available if you know the > data well enough, may be true. But few people are good enough designers to > be sure they've made the decision correctly, and the downsides of being > wrong can be a long, painful conversion process. Easier for most people to > just eliminate the possibility of making a mistake by using auto-generated > surrogate keys, where the primary problem you'll run into is merely using > more space/resources than you might otherwise need to have. It minimizes > the worst-case--mistake make in the model, expensive re-design--by adding > overhead that makes the average case more expensive. Software design > usually has enough risks that any time you can eliminate one just by > throwing some resources at it, that's normally the right thing to do. There are many practical arguments advocating the use of surrogate keys. Faster updates, easier schema changes, performance, maintenance costs down the line, better tools integration, etc. These arguments basically involve trade-offs that can be justifiably be used to push you one way or the other. That's all well and good. However, I tend to disagree with arguments that you are better off not identifying natural keys at all. To my mind, any database that has a table with no discernible key but whose records are referred to via another table's foreign key has a schema that is in a State of Error. A surrogate key is just added information to the database -- why does it play that record X out of N identical candidates should be the one mapped? Is that logic repeatable? What are the ramifications for joins that do not flow through the ID columns? Typically what it means is that the rules that guard against duplicate information entry are not, in fact in the database at all but in the application, and bad data can now get into your database by a much broader array of causes. The last and best defense against a nasty and common class of data errors has been removed. The more complex and your database is, the more it tends to be used a by a large array of clients, possibly even spanning multiple computer languages -- thus the need for a root system of constraint checking that is declarative and easily understood. Sure, requirements change, models change, but at any particular point and time a model with as little as possible (read: none) outside inputs should be able to provably demonstrate verifiable facts. With a natural key database (or a surrogate key database with defined keys that are not used for RI) you have inherent constraint checking that a purely surrogate database simply doesn't have. Whatever the software maintenance costs are, which is itself a complex and debatable topic, I'll go with a strategy that gives a better defense against bad or ambiguous data. merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general