Re: queries on xmin

Brett Henderson <brett@xxxxxxxxxx> · Thu, 11 Jun 2009 21:59:21 +1000

I've been working with Matt on this.  Thanks for the suggestions.

Greg Stark wrote:
On Thu, Jun 11, 2009 at 11:25 AM, Matt Amos<zerebubuth@xxxxxxxxx> wrote:

what problems are we going to create for ourselves if we create a
btree index on xmin casted to int4? would it be as efficient to use a
hash index, create a temporary table of txids that we're querying with
a hash index and do an explicit join? have i missed the point
entirely?

Wow, I'm quite shocked that we don't already detect attempts to create
an index on xmin or xmax. There's no way that'll work properly since
those fields can change spontaneously when, for example vacuum runs or
for xmax when things like UPDATE or SELET FOR SHARE or SELECT FOR
UPDATE are used.

I had just assumed that the index would be updated along with the xmin 
value.  Are you saying that the index would remain set to the original 
value?  I think I read that it only gets set after 100 million 
transactions (by default) which would be okay for our needs, we'd have 
long ago replicated the changes by then.  If we really do need to 
re-generate replication files after that long we could just do it using 
a timestamp.
Incidentally the reason there's no btree opclass is because xids don't
monotonically increase. They wrap around. So periodically you would
lose updates or see them repeatedly whenever the xid wrapped around
and the old transactions appear to be in the future.

Yep, understood.  We'd be converting to a signed integer which means 
we'd have two special points to deal with, the overflow point from 2^31 
to -2^31, and the wraparound back to 0 with 0-2 to be ignored.  Would it 
work if we did something along the lines of:
(xmin >= startX AND xmin <= (2^31 - 1) OR (xmin >= -2^31 AND xmin <= 
finishTx)

One other thing I should mention is that we'd be adding a new function 
to the db (eg. xid_to_int4) that would convert the xid to an int4 and 
wrap values greater than 2^31 around to negative values thus avoiding 
the database raising overflow errors.
If you never run updates and are never interested in tuples that are
old enough to be frozen then perhaps you could mostly get away with
it. But I really don't think it's a good idea.

Much better would be to store a user-space column with somethin like
txid or a timestamp and use that directly. That way you have control
over the behaviour of the column.

We already have a timestamp but with long running transactions some of 
the rows don't become visible until after we've replicated that time 
interval.  The existing implementation queries for changes based on 
timestamp and runs with a 5 minute lag, but we see some transactions 
taking up to half an hour which means we miss the data committed as part 
of those transactions.

When you suggest to add a txid column, are you suggesting to have 
something like an int8 column populated from a global monotonically 
increasing sequence?  That sounds like an elegant approach.  The 
downside is that between an int8 column and the index we'd be talking 
approximately 16 bytes per row which when multipled by 500 million rows 
(I'm not sure exactly how many there are but that's ballpark) comes out 
at 8GB of additional disk usage.  It's approximately 4 times the size of 
an int4 index on xmin.  Disk consumption isn't critical, but it is a 
consideration.
Another option to consider would be including a boolean column
"dumped" defaulted to false. Then you could have a partial index on
the primary key or date "WHERE NOT dumped". Then when you dump you can
"SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done
"UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET
dumped='t' WHERE NOT dumped RETURNING *" which is basically
equivalent.

I have a couple of hesitations with using this approach:
1. We can only run the replicator once.
2. We can only run a single replicator.
3. It requires write access to the db.

1 is perhaps the biggest issue.  It means that we only get one shot at 
reading changes, and if something goes wrong we lose the results.  It's 
nice being able to re-generate when something goes wrong.
We could live with 2, although it makes it impossible to test new 
replication mechanisms without adding additional columns for each.
3 is also a major consideration, it makes everybody's life easier if we 
can avoid updates being made to the db by the replicator.
That would create twice as much traffic in the table which would make
vacuums much more important. But it would mean you could quickly acces
undumped records using the index and know that your process doesn't
depend on a following a strict clock schedule.

If we do use range queries then I don't think we have any clock 
dependencies because we'd remove the timestamp clause from the queries.

I hope I don't sound too negative.  My gut also tells me that what we're 
doing is not the "right" solution and I've had fairly similar arguments 
with Matt already :-)  But having spent some time playing with it I 
can't find any reason why it won't work, and from a performance point of 
view I suspect it will win ...

Brett

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general