Hello all,
I am in the process of migrating a system from Postgresql 8.3 to 9.0, and
have run into a problem with the task queue systems I am using.
The task queue controls the allocation of tasks between about 1000
processes working in parallel, and is essentially a table of
record_id (unique)
project_id
task_description_id
state (idle, started, finished)
Each project currently have about 2 million entries. My plan is to
increase that significantly the next few months.
To avoid having the processes trample each other's queries (the first
attempt was to select the first matching entries of the table, which
caused one to block all other transactions), one of the steps I took was
to select a set of idle rows at a random offset into the table from the
project, mark them for update, then update each record's state as started.
SELECT record_id FROM queue WHERE project_id = my_project AND state =
idle LIMIT n OFFSET i FOR UPDATE
At present "n" is 100-150, "i" is a random value in the range 0-10000.
There is, intentionally, no ordering specified, since that would just slow
down the query, and is not necessary.
For reference, the above query is sent through Django's cursor.execute()
call in a manual transaction block.
What I've discovered when using Postgres 9.0 is that the processes are now
blocking every other query into this table, apparently reducing the task
processing speed by at least a factor of 10, and increasing the load on
the server by a similar factor, compared to when Postgres 8.3 was used.
The problem is apparent just after starting, with only 50-100 processes
active (startup is staggered).
Reducing "n" (and looping), or increasing the "i" range did not work.
The reason seems to be this new part of
http://www.postgresql.org/docs/9.0/static/sql-select.html (towards the end
of the FOR UPDATE section):
If a LIMIT is used, locking stops once enough rows have been returned
to satisfy the limit
(but note that rows skipped over by OFFSET will get locked). Similarly,
if FOR UPDATE or
FOR SHARE is used in a cursor's query, only rows actually fetched or
stepped past by the
cursor will be locked.
I can't find similar text in the 8.3 or 8.4 documentation.
AFAICT, and assuming I have not misunderstood this part of the
documentation this means that if one of my processing nodes selects a
block of 100 entries at offset 8000 in the resulting table, then every
other node will be blocked while the block is being processed, not just
the nodes that would have selected the rows in the range 0 to 7999, but
also >=8100, because they cannot gain access to the rows.
Also, using FOR SHARE does not seem to solve the problem.
IMO, as a database non-expert, locking rows that were not returned as a
result of the query is a bug. As an example, if a query selects the X last
items in the matching rows, that is equivalent to locking the table, or
the relevant part of it, even if the requester have no intention to modify
those other rows.
Is there any way to avoid this problem? Or do I have to add a random
batch_id field to the queue table in order to separate the processes'
queries so that they do not block each other (as frequently)?
Is it possible to disable the source code causing this (that is, reverting
the patch that introduced the problem, or changing a configuration switch)?
--
Sincerely,
Yngve N. Pettersen
********************************************************************
Senior Developer Email: yngve@xxxxxxxxx
Opera Software ASA http://www.opera.com/
Phone: +47 23 69 32 60 Fax: +47 23 69 24 01
********************************************************************
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general