Re: pg_xlog becomes extremely large during CREATE INDEX

Tom Lane <tgl@sss.pgh.pa.us> · Sat, 15 May 2004 12:44:27 -0400

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Sat, May 15, 2004 at 12:23:18AM -0400, Tom Lane wrote:
>> ... It tries to write all the levels of the btree in
>> parallel, so it is spitting out level-zero pages at a great rate,
>> level-one pages at a lesser rate, etc.  For a large index there could
>> be many btree levels, and pages in the higher levels will be held locked
>> in the shared buffer arena for considerable periods.

> I'm not sure how btree from-scratch-building work, but could it be made
> so that it first build level 0 completely, scanning the heap; then build
> level 1 scanning the level 0, and so on?

I don't think that would be a win; it would require an extra read pass
over nearly all of the index.  The idea here is to reduce the amount of
I/O, not increase it.

> I also wonder why index creation would write XLog information;

Yeah, that was implicit in my comment about not using the buffer
manager, and it was why I was concerned about being able to fsync
the index file.  Without xlog for backup, you'd better be able to
force the index to disk before you commit.

It turns out that btree index build currently writes each page three
times: once to extend the file, once into WAL, and then finally to write
the completed index page.  It is difficult to do any better as long as
you're using the shared buffer manager.  (When extending a file, the
buffer manager wants to allocate disk space before it will give you a
buffer to write into.  This is the right thing for heap insertions and
extending an existing index, because we don't force buffers to disk
before commit.  So if it didn't work that way, you might commit your
transaction before finding out there's no disk space to hold the data.
But we could dispense with that worry for index build if we ensure that
we write and fsync all the pages before commit.)  So I was thinking
about keeping the same index build process, but working in private
storage instead of shared buffers, and writing direct to smgr.  This
would not only eliminate the locking conflict against checkpoints,
but reduce the amount of I/O involved quite a lot.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)