On Mon, Nov 18, 2013 at 5:39 PM, Stefan Keller <sfkeller@xxxxxxxxx> wrote:
Hi Jeff and MartinOn 18. November 2013 17:44 Jeff Janes <jeff.janes@xxxxxxxxx> wrote:> I rather doubt that. All the bottlenecks I know about for well cached read-only workloads are around> locking for in-memory concurrency protection, and have little or nothing to do with secondary storage.Interesting point. But I think this is only partially the case - as Stonebraker asserts [1]. While I don't see how to speed-up locking (and latching), AFAIK there is quite some room for enhancement in buffer pooling (see also [2]). Especially in GIS environments there are heavy calculations and random access operations - so buffer pool will play a role.
I don't know what you mean about enhancements in the buffer pool. For an in-memory database, there shouldn't be a buffer pool in the first place, as it is *all* in memory. But anyway, it seems to me like the imported database size for the OpenStreetMap is going to be about 500GB (extrapolating from Antarctica, the only file I could download within a reasonable time), and none of the servers listed in the benchmark had anywhere near that amount of memory, so that has little chance of working as an in-memory database.
Improvements made for just in-memory situations suddenly become worthless if your data grows 10% larger and no longer fits in memory, while improvements that work for everyone work for everyone. There is a place for both, but it is not surprising that a project to make general-purpose tools spends more time on the latter than the former.
To Martin: Stonebraker explicitly supports my hypothesis that in-memory databases become prevalent in the future and that "elephants" will be challenged if they don't adapt to new architectures, like in-memory and column stores.
This would be more persuasive if there were impressive osm2vertica or osm2oracle benchmarking results to show. Otherwise the claims just look like commercial marketing material to me. I'm not saying there are no improvements to be made, but the way to make them is to figure out where the bottlenecks are, not read other people's advertisements and chase them.
The specific use case here is a PostGIS query of an OpenStreetMap data of the whole world (see [3]).
>> BTW: Having said (to Martijn) that using Postgres is probably more efficient, than programming an in-memory> database in a decent language: OpenStreetMap has a very, very large Node table which is heavily> used by other tables (like ways) - and becomes rather slow in Postgres.>> Do you know why it is slow? I'd give high odds that it would be a specific implementation detail in> the code that is suboptimal, or maybe a design decision of PostGIS, rather than some high level> architectural decision of PostgreSQL.Referring to the application is something you can always say - but shouldn't prevent on enhancing Postgres.
Postgres has been enhanced. Now we need to change osm2pgsql to take advantage of them. It defines indexes on the tables that are going to be bulk loaded with COPY, which defeats some recent optimizations made to COPY. The creation of the indexes should be delayed until after the bulk load is done.
A further enhancement to Postgres would be would be to automatically defer creation of the indexes when a table is truncated or created within a transaction, so that users get the benefit of the improvement even without taking special steps. But that would be quite a bit of work, and would probably also change user-visible behavior, in that violations of unique constraints would not be detected until later in the process.
Cheers,
Jeff