Re: Postgres as In-Memory Database?

Stefan Keller <sfkeller@xxxxxxxxx> · Mon, 31 Mar 2014 02:06:11 +0200

Hi Jeff

2013/11/20 Jeff Janes <jeff.janes@xxxxxxxxx>

On Mon, Nov 18, 2013 at 5:39 PM, Stefan Keller <sfkeller@xxxxxxxxx> wrote:

Hi Jeff and Martin

On 18. November 2013 17:44 Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

> I rather doubt that.  All the bottlenecks I know about for well cached read-only workloads are around 

> locking for in-memory concurrency protection, and have little or nothing to do with secondary storage.  

Interesting point. But I think this is only partially the case - as Stonebraker asserts [1]. While I don't see how to speed-up locking (and latching), AFAIK there is quite some room for enhancement in buffer pooling (see also [2]). Especially in GIS environments there are heavy calculations and random access operations - so buffer pool will play a role.

I don't know what you mean about enhancements in the buffer pool.  For an in-memory database, there shouldn't be a buffer pool in the first place, as it is *all* in memory.  

You are right: In-memory DBs are making buffer-pooling obsolete - except for making data persistent (see below).

But anyway, it seems to me like the imported database size for the OpenStreetMap is going to be about 500GB (extrapolating from Antarctica, the only file I could download within a reasonable time), and none of the servers listed in the benchmark had anywhere near that amount of memory, so that has little chance of working as an in-memory database. 

Improvements made for just in-memory situations suddenly become worthless if your data grows 10% larger and no longer fits in memory, while improvements that work for everyone work for everyone.  There is a place for both, but it is not surprising that a project to make general-purpose tools spends more time on the latter than the former.

To Martin: Stonebraker explicitly supports my hypothesis that in-memory databases become prevalent in the future and that "elephants" will be challenged if they don't adapt to new architectures, like in-memory and column stores.

This would be more persuasive if there were impressive osm2vertica or osm2oracle benchmarking results to show.  Otherwise the claims just look like commercial marketing material to me.  I'm not saying there are no improvements to be made, but the way to make them is to figure out where the bottlenecks are, not read other people's advertisements and chase them.

The specific use case here is a PostGIS query of an OpenStreetMap data of the whole world (see [3]).

On 2013/11/18 Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

>> On Sun, Nov 17, 2013 at 4:02 PM, Stefan Keller <sfkeller@xxxxxxxxx> wrote:

>> BTW: Having said (to Martijn) that using Postgres is probably more efficient, than programming an in-memory 

> database in a decent language: OpenStreetMap has a very, very large Node table which is heavily 
> used by other tables (like ways) - and becomes rather slow in Postgres.
> 
> Do you know why it is slow?  I'd give high odds that it would be a specific implementation detail in 

> the code that is suboptimal, or maybe a design decision of PostGIS, rather than some high level 
> architectural decision of PostgreSQL.

Referring to the application is something you can always say - but shouldn't prevent on enhancing Postgres.

Postgres has been enhanced.  Now we need to change osm2pgsql to take advantage of them.  It defines indexes on the tables that are going to be bulk loaded with COPY, which defeats some recent optimizations made to COPY.  The creation of the indexes should be delayed until after the bulk load is done.  

A further enhancement to Postgres would be would be to automatically defer creation of the indexes when a table is truncated or created within a transaction, so that users get the benefit of the improvement even without taking special steps.  But that would be quite a bit of work, and would probably also change user-visible behavior, in that violations of unique constraints would not be detected until later in the process.

These enhancements to osm2pgsql seem to be reasonable to me. I hope somebody has time to care about.

In the meantime I discussed with HANA users an thought about what makes in-memory dbs special and how to configure Postgres to be an in-memory db.

There seem to be two main things which make in-memory dbs special: 
1. Index: Having all data in memory there is no need for a default index. A full-table scan "suddenly" becomes the default.
2. Persistence: All updates/changes to the db are streamed to disk.
3. An increase of data simply needs to be compensated with more memory (since memory became cheap).

AFAIK one can configure Postgres for all these properties: One can force the use of full-table scan(?) and replicate to disk(?).

What do you think about this? Are there any show cases out there?

-S.