Re: Use Postgres as a column store by creating one table per column

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> · Fri, 24 May 2019 20:06:29 +0200

On Thu, May 23, 2019 at 01:08:42AM -0400, George Neuner wrote:
On Tue, 21 May 2019 21:28:07 -0700, Lev Kokotov 
<lev.kokotov@xxxxxxxxx> wrote:

Is it efficient to use Postgres as a column store by creating one 
table per
column?

I would query it with something like `[...] UNION SELECT value AS <table>
FROM <table> WHERE value = <value> UNION [...]` to build a row.

I think you mean JOIN.

You'd need more than that: Postgresql uses MVCC for concurrency, so 
whenever you update any row in a table, the ordering of the rows 
within the table changes.  And the JOIN operation inherently is 
unordered - you need to sort the result deliberately to control 
ordering.

To emulate a column-store, at the very least you need a way to 
associate values from different "columns" that belong to the same 
"row" of the virtual table.  IOW, every value in every "column" needs 
an explicit "row" identifier.  E.g.,

   col1 = { rowid, value1 }, col2 = { rowid, value2 }, ...

For performance you would need to have indexes on at least the rowid 
in each of the "column" tables.

This is a bare minimum and can only work if the columns of your 
virtual table and the queries against it are application controlled or 
statically known.  If you want to do something more flexible that will 
support ad hoc table modifications, elastically sized values (strings, 
bytes, arrays, JSON, XML), etc. this example is not suffice and the 
implementation can get very complicated very quickly

Justin Pryzby was not joking when he said the performance could be 
awful ... at least as compared to a more normal row-oriented 
structure.  Performance of a query that involves more than a handful 
of "columns", in general, will be horrible.  It is up to you to decide 
whether some (maybe little) increase in performance in processing 
*single* columns will offset likely MASSIVE loss of performance in 
processing multiple columns.

Maybe take a look at this paper:

   http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

which essentially compares this approach to a "real" column store.

It certainly won't give you performance comparable to column store, it
adds quite a bit of overhead (disk space because of row headers, CPU
because of extra joins, etc.).

And it can't give you the column-store benefits - compression and/or
more efficient execution.

I'm thinking since Postgres stores tables in continuous blocks of 
16MB each
(I think that's the default page size?)

Default page size is 8 KB.  You'd have to recompile to change that, 
and it might break something - a whole lot of code depends on the 
knowing the size of storage pages.

Right. And the largest page size is 64kB. But 8kB is a pretty good
trade-off, in most cases.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services