Re: Air-traffic benchmark

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Fri, 08 Jan 2010 11:45:00 +0800

On 8/01/2010 2:11 AM, Nikolas Everett wrote:
This table is totally unnormalized.  Normalize it and try again.  You'll
probably see a huge speedup.  Maybe even 10x.  My mantra has always been
less data stored means less data to scan means faster scans.

Sometimes one intentionally denormalizes storage, though. JOIN costs can 
be considerable too, and if most of the time you're interested in all 
the data for a record not just a subset of it, storing it denormalized 
is often faster and cheaper than JOINing for it or using subqueries to 
fetch it.

Normalization or any other splitting of record into multiple separately 
stored records also has costs in complexity, management, the need for 
additional indexes, storage of foreign key references, all the extra 
tuple headers you need to store, etc.

It's still generally the right thing to do, but it should be thought 
about, not just tackled blindly. I only tend to view it as a no-brainer 
if the alternative is storing numbered fields ("field0", "field1", 
"field2", etc) ... and even then there are exceptions. One of my schema 
at the moment has address_line_1 through address_line_4 in a `contact' 
entity, and there's absolutely *no* way I'm splitting that into a 
separate table of address_lines accessed by join and sort! (Arguably I 
should be using a single `text' field with embedded newlines instead, 
though).

Sometimes it's even better to hold your nose and embed an array in a 
record rather than join to an external table. Purism can be taken too far.

Note that Pg's TOAST mechanism plays a part here, too. If you have a big 
`text' field, it's probably going to get stored out-of-line (TOASTed) 
anyway, and TOAST is going to be cleverer about fetching it than you 
will be using a JOIN. So storing it in-line is likely to be the right 
way to go. You can even force out-of-line storage if you're worried.

In the case of this benchmark, even if they split much of this data out 
into other tables by reference, it's likely to be slower rather than 
faster if they still want the data they've split out for most of their 
queries.

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance