Vance Maverick wrote:
Another approach would be to queue the log entries in a "staging" table,
so that a single process could move them into the log. This is fairly
heavyweight, but it would guarantee the consistent sequencing of the log
as seen by a reader (even if the order of entries in the log didn't
always reflect the true commit sequence in the staging table).
The way I see it, one way or another you are going to have to serialize
writers, otherwise you'll always be faced with commit-order races.
I do have one other idea, but it's not pretty. Write a C/Python/whatever
procedural function, say send_log_record(....), that uses an appropriate
inter-process communication mechanism to send the log message to another
process. Your writers use this function to generate log records. The
outside process receiving log records has a single connection to the DB
open and it is the only writer to the log table, thus avoiding the
problems with commit races with multiple writers. Your C function is
bypassing transactional isolation by communicating with another process
that modifies the DB, and in the process eliminating the need to hold
the whole transaction up to ensure predictable log write ordering.
However, it *does* mean that you'll get a log entry even if the
transaction then aborts. You might be able to get around that by doing
your logging with a deferred trigger, but there's always a risk that a
later deferred trigger will fail and abort the transaction.
Doing it via a staging table is a *lot* nicer, and a lot lighter weight,
than having your logging code force serialization of all operations in
transactions that could otherwise, other than the log ordering
requirement, run concurrently. Say, using an id generation table that
each transaction locks to ensure ordered ID generation and commits. It's
also properly transactional, so you won't have any log records for
aborted transactions.
It's a pity PostgreSQL's RETURNING extension appear to doesn't support
INSERT INTO ... DELETE FROM ... RETURNING
because that'd make your log record mover a rather efficient one-liner.
The only other alternative I can think of is to have multiple writers
inserting records into the same table the readers are reading from, but
have the writers insert records with a timestamp field (say `visible')
set to null.
A single helper (there must only ever be one) can then repeatedly run a
command sequence like:
BEGIN;
UPDATE logtable
SET visible = current_timestamp
WHERE visible IS NULL
COMMIT;
to ensure that records became visible in timestamp order. No race here;
like in the staging table approach you're using a single write
transaction on the table being used by the readers.
Readers would filter for records that have `visible >
last_seen_visible', where last_seen_visible would be a literal, being
the greatest value of `visible' seen in the last query for log records.
They could trust that no records could ever be missed.
Unfortunately, with this approach you're incurring the cost of a dead
row for every UPDATE. You can avoid that with 8.3 (using HOT) only if
you have no index on `visible' - but having no index means a sequential
scan of the log table for every UPDATE making rows visible and for every
SELECT looking for the latest rows. Ouch.
That's basically just another way to write your log staging approach,
anyway. It stays within a single table but it's not otherwise much
different. I haven't any idea whether it'd perform better or worse than
using a separate log staging table.
If you really want to make somebody cry, I guess you could do it with
dblink - connect back to your own database from dblink and use a short
transaction to commit a log record, using table-based (rather than
sequence) ID generation to ensure that records were inserted in ID
order. That'd restrict the "critical section" in which your various
transactions were unable to run concurrently to a much shorter period,
but would result in a log message being saved even if the transaction
later aborted. It'd also be eye-bleedingly horrible, to the point where
even the "send a message from a C function" approach would be nicer.
--
Craig Ringer