I've been working with Matt on this. Thanks for the suggestions.
Greg Stark wrote:
On Thu, Jun 11, 2009 at 11:25 AM, Matt Amos<zerebubuth@xxxxxxxxx> wrote:
what problems are we going to create for ourselves if we create a
btree index on xmin casted to int4? would it be as efficient to use a
hash index, create a temporary table of txids that we're querying with
a hash index and do an explicit join? have i missed the point
entirely?
Wow, I'm quite shocked that we don't already detect attempts to create
an index on xmin or xmax. There's no way that'll work properly since
those fields can change spontaneously when, for example vacuum runs or
for xmax when things like UPDATE or SELET FOR SHARE or SELECT FOR
UPDATE are used.
I had just assumed that the index would be updated along with the xmin
value. Are you saying that the index would remain set to the original
value? I think I read that it only gets set after 100 million
transactions (by default) which would be okay for our needs, we'd have
long ago replicated the changes by then. If we really do need to
re-generate replication files after that long we could just do it using
a timestamp.
Incidentally the reason there's no btree opclass is because xids don't
monotonically increase. They wrap around. So periodically you would
lose updates or see them repeatedly whenever the xid wrapped around
and the old transactions appear to be in the future.
Yep, understood. We'd be converting to a signed integer which means
we'd have two special points to deal with, the overflow point from 2^31
to -2^31, and the wraparound back to 0 with 0-2 to be ignored. Would it
work if we did something along the lines of:
(xmin >= startX AND xmin <= (2^31 - 1) OR (xmin >= -2^31 AND xmin <=
finishTx)
One other thing I should mention is that we'd be adding a new function
to the db (eg. xid_to_int4) that would convert the xid to an int4 and
wrap values greater than 2^31 around to negative values thus avoiding
the database raising overflow errors.
If you never run updates and are never interested in tuples that are
old enough to be frozen then perhaps you could mostly get away with
it. But I really don't think it's a good idea.
Much better would be to store a user-space column with somethin like
txid or a timestamp and use that directly. That way you have control
over the behaviour of the column.
We already have a timestamp but with long running transactions some of
the rows don't become visible until after we've replicated that time
interval. The existing implementation queries for changes based on
timestamp and runs with a 5 minute lag, but we see some transactions
taking up to half an hour which means we miss the data committed as part
of those transactions.
When you suggest to add a txid column, are you suggesting to have
something like an int8 column populated from a global monotonically
increasing sequence? That sounds like an elegant approach. The
downside is that between an int8 column and the index we'd be talking
approximately 16 bytes per row which when multipled by 500 million rows
(I'm not sure exactly how many there are but that's ballpark) comes out
at 8GB of additional disk usage. It's approximately 4 times the size of
an int4 index on xmin. Disk consumption isn't critical, but it is a
consideration.
Another option to consider would be including a boolean column
"dumped" defaulted to false. Then you could have a partial index on
the primary key or date "WHERE NOT dumped". Then when you dump you can
"SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done
"UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET
dumped='t' WHERE NOT dumped RETURNING *" which is basically
equivalent.
I have a couple of hesitations with using this approach:
1. We can only run the replicator once.
2. We can only run a single replicator.
3. It requires write access to the db.
1 is perhaps the biggest issue. It means that we only get one shot at
reading changes, and if something goes wrong we lose the results. It's
nice being able to re-generate when something goes wrong.
We could live with 2, although it makes it impossible to test new
replication mechanisms without adding additional columns for each.
3 is also a major consideration, it makes everybody's life easier if we
can avoid updates being made to the db by the replicator.
That would create twice as much traffic in the table which would make
vacuums much more important. But it would mean you could quickly acces
undumped records using the index and know that your process doesn't
depend on a following a strict clock schedule.
If we do use range queries then I don't think we have any clock
dependencies because we'd remove the timestamp clause from the queries.
I hope I don't sound too negative. My gut also tells me that what we're
doing is not the "right" solution and I've had fairly similar arguments
with Matt already :-) But having spent some time playing with it I
can't find any reason why it won't work, and from a performance point of
view I suspect it will win ...
Brett
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general