Lee Hachadoorian wrote:
On 01/05/2012 06:18 PM, Tom Lane wrote:
Are there really 23000 populated values in each row? I hesitate to
suggest an EAV approach, but it kinda seems like you need to go in that
direction. You're never going to get decent performance out of a schema
that requires 100-way joins, even if you avoid bumping up against hard
limits.
Many of the smaller geographies, e.g. census tracts, do in fact have
data for the vast majority of the columns. I am trying to combine it all
into one table to avoid the slowness of multiple JOINs (even though in
practice I'm never joining all the tables at once). EAV sounds correct
in terms of normalization, but isn't it usually better performance-wise
to store write-once/read-many data in a denormalized (i.e. flattened)
fashion? One of these days I'll have to try to benchmark some different
approaches, but for now planning on using array columns, with each
"sequence" (in the Census sense, not the Postgres sense) of 200+
variables in its own array rather than its own table.
EAV is not necessarily more correct than what you're doing.
The most correct solution is one where your database schema defines, and the
DBMS enforces, all of the constraints or business rules on your data, so that
you can not put something in the database that violates the business rules.
Traditional EAV, if you're talking about the common binary table of
unconstrained field-name,field-value pairs, is not an improvement.
A more correct solution is to use different columns for things with different
business rules or data types. If the DBMS can't handle this then that is
grounds for improving the DBMS.
There's no reason that joins *have* to be slow, and in some DBMS designs you can
join in linear time, its all about how you implement.
This all being said, 23K values per row just sounds wrong, and I can't imagine
any census forms having that many details.
Do you, by chance, have multiple values of the same type that are in different
fields, eg telephone_1, telephone_2 or child_1, child_2 etc? You should take
any of those and collect them into array-typed fields, or separate tables with
just telephone or child columns. Or do you say have a set of coordinates in
separate fields? Or you may have other kinds of redundancy within single rows
that are best normalized into separate rows.
With 23K values, these probably have many mutual associations, and you could
split that table into a bunch of other ones where columns that relate more
closely together are collected.
What I said in the last couple paragraphs is probably your earliest best thing
to fix, so you both have a better design and it performs together on the DBMS
you have.
-- Darren Duncan
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general