Reid Thompson wrote:
Each server process claims a jobq record by selecting for update a
jobq record where the pid column is null, then rewrites the record with
the pid set in the pid column.
The "distilled" sql select statement is:
* SELECT J.*, C.name, C.client_id, C.priority
* FROM jobq J, campaign C
* WHERE J.pid IS NULL
* AND 'my_YYYYMMDDhhmmss'>=J.due_date
* AND J.campaign_id=C.id
* ORDER BY C.priority,J.due_date,J.item_id
* LIMIT 1 FOR UPDATE;
What we are seeing is:
- We dump hundreds of records into the jobq table, expecting that a
server process will each claim a record (and process it), but some of
the processes call the claim function but get nothing returned by the
select; they sleep for a short time (eg 10 seconds - this sleep
differs on a per process basis) then attempt to claim again.
Sometimes this claim returns a jobq record, other times it may take 4
or 5 attempted claims before a record is actually returned by the
select.
Our expectation was that with hundreds of selectable records available,
that each server process would 'get' the next available record --
Is this a false expectation on our part?
Is the currently 'active' select for update limit 1 blocking the others?
If not, can someone give us some direction..??
The problem is that the order of events isn't quite what you expect.
You can think of the SELECT ... LIMIT 1 as identifying a set containing
one row. Since it's FOR UPDATE, the backend tries to get a lock, and if
necessary waits until it can. Of course, all the other processors are
doing the same thing.
When another process updates that row, the backend re-checks the
conditions and you end up with a set of zero rows, which it helpfully
gains a lock on and then returns to you!
So - what to do? Well, there are several options.
1. Keep the lock/update period short by just marking the row as
"claimed" and committing. You still need to loop trying to get a job to
process, but the process is smoother because the possible window for
conflicts is short.
2. Select "random" jobs from the list if you don't care about order.
Again, you're just making things smoother, not removing the requirement
to loop checking.
3. Have a "master" process that allocates jobs to processors. Or
simulate such a thing by having processors call an update_jobs()
function. That way you can pre-allocate jobs to processors (at perhaps a
cost in efficiency if the number of jobs in-queue falls below the number
of processors).
--
Richard Huxton
Archonet Ltd