Re: Storing large documents - one table or partition by doc?

"Mike Sofen" <msofen@xxxxxxxxxx> · Fri, 23 Sep 2016 05:14:12 -0700

From: Dev Nop  Sent: Friday, September 23, 2016 3:12 AM
I’m storing thousands of independent documents each containing around 20k rows. The larger the document, the more likely it is to be active with inserts and updates (1000s/day). The most common read query is to get all the rows for a single document (100s/day). It will be supporting real-time collaboration but with strong-consistency for a simple schema so not well-suited to dedicated "document databases" that assume schema-less & eventual consistency. I won’t have great hardware/budget so need to squeeze the most out of the least.

My question is whether to put all documents into a single huge table or partition by document?

The documents are independent so its purely a performance question. Its too many tables for postgresql partitioning support but I don’t get any benefit from a master table and constraints. Handling partitioning in application logic is effectively zero cost.

I know that 1000s of tables is regarded as an anti-pattern but I can only see the performance and maintenance benefits of one table per independent document e.g. fast per-table vacuum, incremental schema updates, easy future sharding. A monster table will require additional key columns and indexes that don’t have any value beyond allowing the documents to sit in the same table.

The only downsides seem to be the system level per-table overhead but I only see that as a problem if I have a very long tail of tiny documents. I'd rather solve that problem if it occurs than manage an all-eggs-in-one-basket monster table.

Is there anything significant I am missing in my reasoning? Is it mostly a “relational purist” perspective that argues against multiple tables? Should I be looking at alternative tech for this problem?

The one factor I haven't fully resolved is how much a caching layer in front of the database changes things.

Thanks for your help.
---------------------------------
This is, to me, a very standard, almost classic, relational pattern, and one that a relational engine handles extremely well, especially the consistency and locking needed to support lots of updates.  Inserts are irrelevant unless the parent record must be locked to do so…that would be a bad design.

Imagine a normal parent-child table pair, 1:M, with the 20k rows per parent document in the child table.  Unless there’s something very bizarre about the access patterns against that child table, those 20k rows per document would not normally all be in play for every user on every access throughout that access (it’s too much data to show on a web page, for instance).  Even so, at “100s” of large queries per day, it’s a trivial load unless each child row contains a large json blob…which doesn’t jive with your table description.

So with proper indexing, I can’t see where there will be a performance issue.   Worst case, you create a few partitions based on some category, but the row counts you’re describing don’t yet warrant it.  I’m running a few hundred million rows in a new “child” table on a dev server (4 cores/16gb ram) with large json documents in each row and it’s still web page performant on normal queries, using a paging model (say 20 full rows per web page request).  The critical pieces, hardware-wise, are memory (buy as much as you can afford) and using SSDs (required, IMO).  It’s much harder to create measurable loads on the CPUs.  Amazon has memory optimized EC2 instances that support that pattern (with SSD storage).

Are there other issues/requirements that are creating other performance concerns that aren’t obvious in your initial post?

Mike Sofen     (Synthetic Genomics)