On 8/01/2010 2:11 AM, Nikolas Everett wrote:
This table is totally unnormalized. Normalize it and try again. You'll probably see a huge speedup. Maybe even 10x. My mantra has always been less data stored means less data to scan means faster scans.
Sometimes one intentionally denormalizes storage, though. JOIN costs can be considerable too, and if most of the time you're interested in all the data for a record not just a subset of it, storing it denormalized is often faster and cheaper than JOINing for it or using subqueries to fetch it.
Normalization or any other splitting of record into multiple separately stored records also has costs in complexity, management, the need for additional indexes, storage of foreign key references, all the extra tuple headers you need to store, etc.
It's still generally the right thing to do, but it should be thought about, not just tackled blindly. I only tend to view it as a no-brainer if the alternative is storing numbered fields ("field0", "field1", "field2", etc) ... and even then there are exceptions. One of my schema at the moment has address_line_1 through address_line_4 in a `contact' entity, and there's absolutely *no* way I'm splitting that into a separate table of address_lines accessed by join and sort! (Arguably I should be using a single `text' field with embedded newlines instead, though).
Sometimes it's even better to hold your nose and embed an array in a record rather than join to an external table. Purism can be taken too far.
Note that Pg's TOAST mechanism plays a part here, too. If you have a big `text' field, it's probably going to get stored out-of-line (TOASTed) anyway, and TOAST is going to be cleverer about fetching it than you will be using a JOIN. So storing it in-line is likely to be the right way to go. You can even force out-of-line storage if you're worried.
In the case of this benchmark, even if they split much of this data out into other tables by reference, it's likely to be slower rather than faster if they still want the data they've split out for most of their queries.
-- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance