Re: Postgres for a "data warehouse", 5-10 TB

Robert Klemme <shortcutter@xxxxxxxxxxxxxx> · Mon, 12 Sep 2011 23:26:10 +0200

On 12.09.2011 19:22, Andy Colson wrote:
On 9/12/2011 12:15 PM, Robert Klemme wrote:
On 11.09.2011 19:02, Marti Raudsepp wrote:
On Sun, Sep 11, 2011 at 17:23, Andy Colson<andy@xxxxxxxxxxxxxxx> wrote:
On 09/11/2011 08:59 AM, Igor Chudov wrote:
By the way, does that INSERT UPDATE functionality or something like
this exist in Postgres?
You have two options:
1) write a function like:
create function doinsert(_id integer, _value text) returns void as
2) use two sql statements:

Unfortunately both of these options have caveats. Depending on your
I/O speed, you might need to use multiple loader threads to saturate
the write bandwidth.

However, neither option is safe from race conditions. If you need to
load data from multiple threads at the same time, they won't see each
other's inserts (until commit) and thus cause unique violations. If
you could somehow partition their operation by some key, so threads
are guaranteed not to conflict each other, then that would be perfect.
The 2nd option given by Andy is probably faster.

You *could* code a race-condition-safe function, but that would be a
no-go on a data warehouse, since each call needs a separate
subtransaction which involves allocating a transaction ID.

Wouldn't it be sufficient to reverse order for race condition safety?
Pseudo code:

begin
insert ...
catch
update ...
if not found error
end

Speed is another matter though...

No, I dont think so, if you had two loaders, both would start a
transaction, then neither could see what the other was doing.

It depends.  But the point is that not both INSERTS can succeed.  The 
one which fails will attempt the UPDATE and - depending on isolation 
level and DB implementation - will be blocked or fail.

In the case of PG this particular example will work:

1. TX inserts new PK row
2. TX tries to insert same PK row => blocks
1. TX commits
2. TX fails with PK violation
2. TX does the update (if the error is caught)

There are
transaction isolation levels, but they are like playing with fire. (in
my opinion).

You make them sound like witchcraft.  But they are clearly defined - 
even standardized.  Granted, different RDBMS might implement them in 
different ways - here's PG's view of TX isolation:

http://www.postgresql.org/docs/8.4/interactive/transaction-iso.html

In my opinion anybody working with RDBMS should make himself familiar 
with this concept - at least know about it - because it is one of the 
fundamental features of RDBMS and certainly needs consideration in 
applications with highly concurrent DB activity.

Kind regards

	robert

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance