Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

Jeff Janes <jeff.janes@xxxxxxxxx> · Tue, 23 Jul 2013 14:02:53 -0700

On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote <amitlangote09@xxxxxxxxx> wrote:
> On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@xxxxxxxxx> wrote:
>> Hello,
>>
>> While understanding the effect of maintenance_work_mem on time taken
>> by CREATE INDEX, I observed that for the values of
>> maintenance_work_mem less than the value for which an internal sort is
>> performed, the time taken by CREATE INDEX increases as
>> maintenance_work_increases.
>>
>> My guess is that for all those values an external sort is chosen at
>> some point and larger the value of maintenance_work_mem, later the
>> switch to external sort would be made causing CREATE INDEX to take
>> longer. That is a smaller value of maintenance_work_mem would be
>> preferred for when external sort is performed anyway. Does that make
>> sense?
>>
>
> Upon further investigation, it is found that the delay to switch to
> external sort caused by a larger value of maintenance_work_mem is
> small compared to the total time of CREATE INDEX.

If you are using trace_sort to report that, it reports the switch as
happening as soon as it runs out of memory.

At point, all we have been doing is reading tuples into memory.  The
time it takes to do that will depend on maintenance_work_mem, because
that affects how many tuples fit in memory.  But all the rest of the
tuples need to be read sooner or later anyway, so pushing more of them
to later doesn't improve things overall it just shifts timing around.

After it reports the switch, it still needs to heapify the existing
in-memory tuples before the tapesort proper can begin.  This is where
the true lost opportunities start to arise, as the large heap starts
driving cache misses which would not happen at all under different
settings.

Once the existing tuples are heapified, it then continues to use the
heap to pop tuples from it to write out to "tape", and to push newly
read tuples onto it.  This also suffers lost opportunities.

Once all the tuples are written out and it starts merging, then the
large maintenance_work_mem is no longer a penalty as the new heap is
limited by the number of tapes, which is almost always much smaller.
In fact this stage will actually be faster, but not by enough to make
up for the earlier slow down.

So it is not surprising that the time before the switch is reported is
a small part of the overall time difference.

Cheers,

Jeff

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general