Re: High-Concurrency GiST in postgreSQL

"C. Mundi" <cmundi@xxxxxxxxx> · Mon, 5 Dec 2011 14:34:00 -0700

On Mon, Dec 5, 2011 at 12:26 PM, Andy Colson <andy@xxxxxxxxxxxxxxx> wrote:

On 12/5/2011 12:31 PM, C. Mundi wrote:

Hello.  This is my first post.  As such, feedback on style and choice of

venue are especially welcome.

I am a regular but not especially expert user of a variety of databases,

including postgreSQL.

I have only modest experience with spatial databases.

I have a new project[1] in which GiST could be very useful, provided I

can achieve high concurrency.<SNIP>

concurrency here can mean different things.  One application hitting PG which then uses multiple threads? (Not currently possible)  Or one app with multiple threads each having a database connection?  (Which is really the same as) Multiple app's each having a database connection?

PG limits one database connection to one cpu.  Multiple connections will use multiple cpu.

OR, by concurrency, do you mean, non-blocking?  And if you mean non-blocking, is that for read's, write's, or both?

In PG you can do non-blocking, multiple connections (ie multiple cpu), reads as much as you want.

Extending to indexes: many connections can read a gist index at the same time.  Is that what you need?

-Andy

Thanks, Andy.  You're quite right of course.  I'm thinking of concurrent clients.  Lots of them.  I envision thousands of actors (which could be threads within a process or separate processes) behaving as clients, each with its own connection to a single-threaded database server.  So concurrency here simply means that the database has to be able to handle a lot of "simultaneous" connections coming at it fast and asynchronously.  (I'm prepared to deploy a thin queuing layer if it turns out that I saturate the server.)  The compute nodes are doing heavy physics which depends on spatially organized data, and it is very difficult to predict which rows an actor will need next.  (In fact, knowing that would presuppose that the problem to be solved could be factored at great savings in computation.) 

So what I really need is minimal locking, as in [Karnacker and Banks 1995]. The whole database can be pre-loaded before the start of the calculation.  Now about 80% of the data in the database will never change during a run.  But about 20% will change via "feedback" from the compute nodes.  And the nature of the problem is that we do not know a priori which data is in the 80% and which is in the 20%.  If we did, we could split the database to ensure no block-on-write impact on reads for the 80%.  Alas, we have to assume that reads and writes are mixed with statistics yet unknown.

The 20% which changes changes spatially, not just in content.  This can lead to the need to rebalance on inserts.  And since splitting nodes in "naive" R* trees is kind of expensive [1], I am wondering to what extent the "sibling links" approach described by Karnacker and Banks has -- as anticipated by Hellerstein et al. -- already been implemented in GiST in postgreSQL.  If it has, then I win just by using the 'cube' contrib extension.  Hellerstein notes that sibling pointers have long been in common use even in B-trees; so I am optimistic for GiST.

So that's my concern.  I'm doing 80% reads which are all non-blocking with 20% writes mixed in, and I need to avoid the effect of writes blocking queries which do not need to traverse branches affected by the write.  

Thanks!
Carlos

[1] Even if inserts were not potentially expensive for the database, the prospect that an insert triggered by one compute node could occasionally cause *all* the compute to stall when not logivally necessary is horrifying.