Re: I/O on select count(*)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 14 May 2008, Alvaro Herrera wrote:
Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.

So, as I understand it, Postgres works like this:

1. You begin a transaction. Postgres writes an entry into pg_clog.
2. You write some tuples. Postgres writes them to the WAL, but doesn't
    bother fsyncing.
3. At some point, the bgwriter or a checkpoint may write the tuples to the
    database tables, and fsync the lot.
4. You commit the transaction. Postgres alters pg_clog again, writes that
    to the WAL, and fsyncs the WAL.
5. If the tuples hadn't already made it to the database tables, then a
    checkpoint or bgwriter will do it later on, and fsync the lot.
6. You read the tuples. Postgres reads them from the database table, looks
    in pg_clog, notices that the transaction has been committed, and
    writes the tuples to the database table again with the hint bits set.
    This write is not WAL protected, and is not fsynced.

This seems like a good architecture, with some cool characteristics, mainly that at no point does Postgres have to hold vast quantities of data in memory. I have two questions though:

Is it really safe to update the hint bits in place? If there is a power cut in the middle of writing a block, is there a guarantee from the disc that the block will never be garbled?

Is there a way to make a shortcut and have the hint bits written the first time the data is written to the table? One piece of obvious low-hanging fruit would be to enhance step five above, so that the bgwriter or checkpoint that writes the data to the database table checks the pg_clog and writes the correct hint bits. In fact, if the tuple's creating transaction has aborted, then the tuple can be vacuumed right there and then before it is even written. For OLTP, almost all the hint bits will be written first time, and also the set of transactions that will be looked up in the pg_clog will be small (the set of transactions that were active since the last checkpoint), so its cache coherency will be good.

However, this idea does not deal well with bulk data loads, where the data is checkpointed before transaction is committed or aborted.

Matthew

--
Now, you would have thought these coefficients would be integers, given that
we're working out integer results. Using a fraction would seem really
stupid. Well, I'm quite willing to be stupid here - in fact, I'm going to
use complex numbers.                    -- Computer Science Lecturer


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux