Matthew Wakeling wrote:
On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a >
power cut in the middle of writing a block, is there a guarantee from
> the disc that the block will never be garbled?
Don't know, to be honest. We've never seen any reports of corrupted
data that would suggest such a problem, but it doesn't seem impossible
to me that some exotic storage system might do that.
Hmm. That problem is what WAL full-page-writes is meant to handle, isn't
it? So basically, if you're telling people that WAL full-page-writes is
safer than partial WAL, because it avoids updating pages in-place, then
you shouldn't be updating pages in-place for the hint bits either. You
can't win!
Full-page-writes protect from torn pages, that is, when one half of an
update hits the disk but the other one doesn't. In particular, if the
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to
disk, but the actual changes elsewhere in the page aren't, you're in
trouble. WAL replay will look at the WAL pointer, and think that the
page doesn't need to be replayed, while other half of the update is
still missing.
Hint bits are different. We're only updating a single bit, and it
doesn't matter from correctness point of view whether the hint bit
update hits the disk or not. But what would spell trouble is if the disk
controller/whatever garbles the whole sector, IOW changes something else
than the changed bit, while doing the update.
In fact, if the tuple's creating transaction has aborted, then the
tuple can be vacuumed right there and then before it is even written.
Not if you have any indexes on the table. To vacuum, you'll have to
scan all indexes to remove pointers to the tuple.
Ah. Well, would that be so expensive? After all, someone has to do it
eventually, and these are index entries that have only just been added
anyway.
Scanning all indexes? Depends on your table of course, but yes it would
be expensive in general.
An alternative would be to build a "list of changes" in the WAL without
actually changing the underlying index at all. When reading the index,
you would read the "list" first (which would be in memory, and in an
efficient-to-search structure), then read the original index and add the
two. Then when checkpointing, vet all the changes against known aborted
transactions before making all the changes to the index together. This
is likely to speed up index writes quite a bit, and also allow you to
effectively vacuum aborted tuples before they get written to the disc.
There's not much point optimizing something that only helps with aborted
transactions.
The general problem with any idea that involves keeping a list of
changes made in a transaction is that that list will grow big during
bulk loads, so you'll have to overflow to disk or abandon the list
approach. Which means that it won't help with bulk loads.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com