On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a
> power cut in the middle of writing a block, is there a guarantee from
> the disc that the block will never be garbled?
Don't know, to be honest. We've never seen any reports of corrupted data
that would suggest such a problem, but it doesn't seem impossible to me
that some exotic storage system might do that.
Hmm. That problem is what WAL full-page-writes is meant to handle, isn't
it? So basically, if you're telling people that WAL full-page-writes is
safer than partial WAL, because it avoids updating pages in-place, then
you shouldn't be updating pages in-place for the hint bits either. You
can't win!
In fact, if the tuple's creating transaction has aborted, then the tuple
can be vacuumed right there and then before it is even written.
Not if you have any indexes on the table. To vacuum, you'll have to scan all
indexes to remove pointers to the tuple.
Ah. Well, would that be so expensive? After all, someone has to do it
eventually, and these are index entries that have only just been added
anyway.
I can understand index updating being a bit messy in the middle of a
checkpoint though, as you would have to write the update to the WAL, which
you are checkpointing...
So, I don't know exactly how the WAL updates to indexes work, but my guess
is that it has been implemented as "write the blocks that we would change
to the WAL". The problem with this is that all the changes to the index
are done individually, so there's no easy way to "undo" one of them later
on when you find out that the transaction has been aborted during the
checkpoint.
An alternative would be to build a "list of changes" in the WAL without
actually changing the underlying index at all. When reading the index, you
would read the "list" first (which would be in memory, and in an
efficient-to-search structure), then read the original index and add the
two. Then when checkpointing, vet all the changes against known aborted
transactions before making all the changes to the index together. This is
likely to speed up index writes quite a bit, and also allow you to
effectively vacuum aborted tuples before they get written to the disc.
Matthew
--
Vacuums are nothings. We only mention them to let them know we know
they're there.