Re: Research and EAV models

Peter Hunsberger <peter.hunsberger@xxxxxxxxx> · Fri, 23 Oct 2009 22:25:50 -0500

On Fri, Oct 23, 2009 at 5:30 PM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:
>
> Your thinking this is a business/research distinction isn't right, it comes
> down to the size of the data set and how fast/expressive the queries against
> them are.  In the research realm, you can watch this struggle play in things
> like bioinformatics, where there's giant stacks of very similar data they
> need to sort through.  What often ends up happening is you have scientists
> start with a simple EAV system, then watch it completely fall down under
> real-world load and complicated queries once it gets past prototype.  Then
> they try to patch it for a while by reinventing SQL query and storage
> concepts (badly).

I've got little idea what the OP is really dealing with but there is a
fundamental difference between the research world and the more
traditional business world.  On the research side the data models are
not well known in advance, there are few established best practices
and even data types are not always well known.  In a conventional
schema we would have about 2,000+ tables with 20,000+ columns.  This
is 6 year old system with 75% of our metadata having been added over
the last 3 years.  I expect growth to slow over the next 3 years, but
a 25% growth would not be unreasonable.  The current physical schema
is some 40 tables with maybe some 300 columns (we also render about
6000 customized presentations of the data and manage work flow and
validation with this schema).  Performance tuning can be a pain, but
we mostly got that mastered at this stage in the game: it's a trade
off, but using a conventional schema would have consumed far more
resources than we've spent over the last 6 years.  The first two
versions of the system were conventional schema and new research
protocols where taking 6 months to a year to add.  We can now add
simple ones in a day or two and complex one in a couple months (the
largest having some 60,000+ pieces of metadata, including validation
rules and workflow).

>
> Eventually someone with traditional database background comes along, notes
> that the data you really need is for the most part predetermined, rewrites
> that into something more amenable to standard SQL, and then the whole thing
> performs better for that subset.  But now you've got stuff like ALTER TABLE
> to add a column every time you want to track something new, and people don't
> like that when the schema is still fluid.  So they start putting stuff into
> EAV stores for their next project...and the cycle begins anew.
>
> Ultimately you can be really flexible in how your store your data, or you
> can get good query performance, but it's quite hard to do both.

Not much argument there!  However, it's a three way trade off: such a
schema can get good performance for a small number of users and / or a
small amount of data (we're about 300 users and 45gb total data).

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general