Re: I/O on select count(*)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Matthew Wakeling wrote:
On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a > power cut in the middle of writing a block, is there a guarantee from > the disc that the block will never be garbled?

Don't know, to be honest. We've never seen any reports of corrupted data that would suggest such a problem, but it doesn't seem impossible to me that some exotic storage system might do that.

Hmm. That problem is what WAL full-page-writes is meant to handle, isn't it? So basically, if you're telling people that WAL full-page-writes is safer than partial WAL, because it avoids updating pages in-place, then you shouldn't be updating pages in-place for the hint bits either. You can't win!

Full-page-writes protect from torn pages, that is, when one half of an update hits the disk but the other one doesn't. In particular, if the beginning of the page where the WAL pointer (XLogRecPtr) is flushed to disk, but the actual changes elsewhere in the page aren't, you're in trouble. WAL replay will look at the WAL pointer, and think that the page doesn't need to be replayed, while other half of the update is still missing.

Hint bits are different. We're only updating a single bit, and it doesn't matter from correctness point of view whether the hint bit update hits the disk or not. But what would spell trouble is if the disk controller/whatever garbles the whole sector, IOW changes something else than the changed bit, while doing the update.

In fact, if the tuple's creating transaction has aborted, then the tuple can be vacuumed right there and then before it is even written.

Not if you have any indexes on the table. To vacuum, you'll have to scan all indexes to remove pointers to the tuple.

Ah. Well, would that be so expensive? After all, someone has to do it eventually, and these are index entries that have only just been added anyway.

Scanning all indexes? Depends on your table of course, but yes it would be expensive in general.

An alternative would be to build a "list of changes" in the WAL without actually changing the underlying index at all. When reading the index, you would read the "list" first (which would be in memory, and in an efficient-to-search structure), then read the original index and add the two. Then when checkpointing, vet all the changes against known aborted transactions before making all the changes to the index together. This is likely to speed up index writes quite a bit, and also allow you to effectively vacuum aborted tuples before they get written to the disc.

There's not much point optimizing something that only helps with aborted transactions.

The general problem with any idea that involves keeping a list of changes made in a transaction is that that list will grow big during bulk loads, so you'll have to overflow to disk or abandon the list approach. Which means that it won't help with bulk loads.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux