Re: is there a select for update insert if not exist type command?

Edson Richter <edsonrichter@xxxxxxxxxxx> · Sat, 9 Jun 2012 23:59:57 -0300

One of the possible strategies that comes to my mind is:

1) Log your "syslog stream" into PostgreSQL database (no need to record 
all message, just a simple table the the event key and a flag 
"processed" field)
2) When problem event arrives, the first server should "select for 
update" on the event of syslog, then update a "processed" field (from 
"0" to "1" or false to true, something like that).

When second server try to get the "select for update" on the events 
table, it will fails, and then it can move ahead in the log looking for 
other errors that need attention.
I cannot say how adversely this strategy would affect your system (or if 
there are other contention involved, like page locks), but seems very 
logical for me.
I've used this to distribute processing on users desktops for a massive 
processing system with success (but using MySQL for storing data).

Regards,

Edson

Em 09/06/2012 23:40, Evan Rempel escreveu:
Thanks for the input. Dave also replied indicating that without more details it
is difficult the really help. I was intentionally vague to see if there was some
SQL standard way like to mysql "insert ... on duplicate update ... " syntax, or
the proposed MSSQL merge command.

Since not, I'll give a lot more detail without writing a book.

We are working on a project where a syslog stream arrives and is parsed in
real time. Every log message is considered to be an "event". Events can usually
be ignored because they are "normal behaviour". Some events indicate a problem
and should create an incident. Repetitions of the same event should not create a new
incident if the current incident is not resolved.

For redundancy, two independent systems will consume and process these events.

Now for the part where postgresql comes in.

When an event that should create an incident is encountered, only one incident
should be created. The incident details are not known until such time as the event
occurs, so no pre-population  of tables can occur. So only one  of the two server should
perform the insert, and then update it with details of successive events, possibly ticketing
system identification, date/time of last pager message that went out.

Once the incident is placed into postgresql, everything is easy, "select for update",
determine all that should take place like paging, updating tickets, recording date/time
of last alert sent to administrators.

It is just that first insert that is the challenge. One system does the insert, only to have the
other do the "select for update". I would like to have an insert that is locked AND visible to
other sessions.

Exclusive lock on the table is an idea, but it serializes ALL new incident creation, and we
only NEED to serialize for the same incident identifier. Since both (all) of the systems
that will be processing the live log stream, in all likelihood all of the servers will always be
working on the same data and thus the same incident, so will always be locking the same
piece of data anyway, so the full table lock may not be any worse.

I was thinking of using and advisory lock, which would also serialize everything, just like a table lock,
but again, that may not be a problem since all processes work on the same data at the same time
anyways.

I could also use a custom "my_locks" table that just has rows with unique values that I do a
"select for update" on to serialize everything. Again, no functional difference from table or advisory locks.

A follow on questions;

Is there anything inherently atomic about a stored procedure?
Does the stored procedure simply run within the transaction context of where it is called from?

begin transaction
   - select storedProc1
   - select storedProc2
commit

Would the actions of both stored procedures would be a single atomic action?

Thanks again for lending me your experience, it can, and is, saving me days.

Evan.

________________________________________
From: Bill Moran [wmoran@xxxxxxxxxxxxxxxxx]
Sent: Saturday, June 09, 2012 4:35 PM
To: Evan Rempel
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  is there a select for update insert if not exist type command?

On Sat, 9 Jun 2012 15:41:34 -0700 Evan Rempel<erempel@xxxxxxx>  wrote:

I have a project where I will have two clients essentially doing the
same things at the same time. The idea is that if one has already done the
work, then the second one does not need to do it.

I was hoping that adding a task related unique identifier to a table
could be used to coordinate these client, something like a primary key and using
select for update.

The challenge I have is during the initial insert. One of the two clients will cause postgresql
to log an error, which I would rather avoid (just seems dirty).

Here is the time line;

Both clients A and B becomes aware to do a task

Client A or client B issues the "select for update ... if not exist do insert" type command
The other client gets blocked on the "select for update.

First client finishes insert/updates to record that it has delt with the task

second client gets unblocked and reads the record realizing that the first client delt with the task already.

It is the "select for update ... if not exist do insert" type command that I am ignorant of how to code.

Anyone care to school me?
It's amazing to me how often I have this conversation ...

How would you expect SELECT FOR UPDATE to work when you're checking to see
if you can insert a row?  If the row doesn't exist, there's nothing to
lock against, and thus it doesn't help anything.  FOR UPDATE is only
useful if you're UPDATING a row.

That being given, there are a number of ways to solve your problem.  Which
one you use depends on a number of factors.

If it's x number of processes all contending for one piece of work, you could
just exclusive lock the entire table, and do the check/insert with the
table locked.  This essentially creates a wait queue.

If the processes need to coordinate around doing several pieces of work, you
can put a row in for each piece of work with a boolean field indicating
whether a process is currently working on it.  Then you can SELECT FOR
UPDATE a particular row representing work to be done, and if the boolean
isn't already true, set it to true and start working.  In my experience,
you'll benefit from going a few steps forward and storing some information
about what's being done on it (like the PID of the process working on it,
and the time it started processing) -- it just makes problems easier to
debug later.

There are other approaches as well, but those are the two that come to
mind.

Not sure what your experience level is, but I'll point out that these
kinds of things only work well if you're transaction management is
correct.  I have seen people struggle to get these kind of things working
because they didn't really understand how transactions and locking interact,
or they were using some sort of abstraction layer that does transaction
stuff in such an opaque way that they couldn't figure out what was actually
happening.

Hope this helps.

--
Bill Moran<wmoran@xxxxxxxxxxxxxxxxx>

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general