Re: JOIN column maximum

Darren Duncan <darren@xxxxxxxxxxxxxxxx> · Thu, 05 Jan 2012 19:19:06 -0800

Lee Hachadoorian wrote:
On 01/05/2012 06:18 PM, Tom Lane wrote:
Are there really 23000 populated values in each row?  I hesitate to
suggest an EAV approach, but it kinda seems like you need to go in that
direction.  You're never going to get decent performance out of a schema
that requires 100-way joins, even if you avoid bumping up against hard
limits.
Many of the smaller geographies, e.g. census tracts, do in fact have 
data for the vast majority of the columns. I am trying to combine it all 
into one table to avoid the slowness of multiple JOINs (even though in 
practice I'm never joining all the tables at once). EAV sounds correct 
in terms of normalization, but isn't it usually better performance-wise 
to store write-once/read-many data in a denormalized (i.e. flattened) 
fashion? One of these days I'll have to try to benchmark some different 
approaches, but for now planning on using array columns, with each 
"sequence" (in the Census sense, not the Postgres sense) of 200+ 
variables in its own array rather than its own table.

EAV is not necessarily more correct than what you're doing.

The most correct solution is one where your database schema defines, and the 
DBMS enforces, all of the constraints or business rules on your data, so that 
you can not put something in the database that violates the business rules.

Traditional EAV, if you're talking about the common binary table of 
unconstrained field-name,field-value pairs, is not an improvement.

A more correct solution is to use different columns for things with different 
business rules or data types.  If the DBMS can't handle this then that is 
grounds for improving the DBMS.

There's no reason that joins *have* to be slow, and in some DBMS designs you can 
join in linear time, its all about how you implement.

This all being said, 23K values per row just sounds wrong, and I can't imagine 
any census forms having that many details.

Do you, by chance, have multiple values of the same type that are in different 
fields, eg telephone_1, telephone_2 or child_1, child_2 etc?  You should take 
any of those and collect them into array-typed fields, or separate tables with 
just telephone or child columns.  Or do you say have a set of coordinates in 
separate fields?  Or you may have other kinds of redundancy within single rows 
that are best normalized into separate rows.

With 23K values, these probably have many mutual associations, and you could 
split that table into a bunch of other ones where columns that relate more 
closely together are collected.

What I said in the last couple paragraphs is probably your earliest best thing 
to fix, so you both have a better design and it performs together on the DBMS 
you have.

-- Darren Duncan

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general