Re: I/O on select count(*)

"Heikki Linnakangas" <heikki@xxxxxxxxxxxxxxxx> · Thu, 15 May 2008 16:08:36 +0100

Matthew Wakeling wrote:
On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a > 
power cut in the middle of writing a block, is there a guarantee from 
> the disc that the block will never be garbled?

Don't know, to be honest. We've never seen any reports of corrupted 
data that would suggest such a problem, but it doesn't seem impossible 
to me that some exotic storage system might do that.

Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
it? So basically, if you're telling people that WAL full-page-writes is 
safer than partial WAL, because it avoids updating pages in-place, then 
you shouldn't be updating pages in-place for the hint bits either. You 
can't win!

Full-page-writes protect from torn pages, that is, when one half of an 
update hits the disk but the other one doesn't. In particular, if the 
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to 
disk, but the actual changes elsewhere in the page aren't, you're in 
trouble. WAL replay will look at the WAL pointer, and think that the 
page doesn't need to be replayed, while other half of the update is 
still missing.

Hint bits are different. We're only updating a single bit, and it 
doesn't matter from correctness point of view whether the hint bit 
update hits the disk or not. But what would spell trouble is if the disk 
controller/whatever garbles the whole sector, IOW changes something else 
than the changed bit, while doing the update.

In fact, if the tuple's creating transaction has aborted, then the 
tuple can be vacuumed right there and then before it is even written. 

Not if you have any indexes on the table. To vacuum, you'll have to 
scan all indexes to remove pointers to the tuple.

Ah. Well, would that be so expensive? After all, someone has to do it 
eventually, and these are index entries that have only just been added 
anyway.

Scanning all indexes? Depends on your table of course, but yes it would 
be expensive in general.

An alternative would be to build a "list of changes" in the WAL without 
actually changing the underlying index at all. When reading the index, 
you would read the "list" first (which would be in memory, and in an 
efficient-to-search structure), then read the original index and add the 
two. Then when checkpointing, vet all the changes against known aborted 
transactions before making all the changes to the index together. This 
is likely to speed up index writes quite a bit, and also allow you to 
effectively vacuum aborted tuples before they get written to the disc.

There's not much point optimizing something that only helps with aborted 
transactions.

The general problem with any idea that involves keeping a list of 
changes made in a transaction is that that list will grow big during 
bulk loads, so you'll have to overflow to disk or abandon the list 
approach. Which means that it won't help with bulk loads.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com