On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote <amitlangote09@xxxxxxxxx> wrote: > On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote <amitlangote09@xxxxxxxxx> wrote: >> Hello, >> >> While understanding the effect of maintenance_work_mem on time taken >> by CREATE INDEX, I observed that for the values of >> maintenance_work_mem less than the value for which an internal sort is >> performed, the time taken by CREATE INDEX increases as >> maintenance_work_increases. >> >> My guess is that for all those values an external sort is chosen at >> some point and larger the value of maintenance_work_mem, later the >> switch to external sort would be made causing CREATE INDEX to take >> longer. That is a smaller value of maintenance_work_mem would be >> preferred for when external sort is performed anyway. Does that make >> sense? >> > > Upon further investigation, it is found that the delay to switch to > external sort caused by a larger value of maintenance_work_mem is > small compared to the total time of CREATE INDEX. If you are using trace_sort to report that, it reports the switch as happening as soon as it runs out of memory. At point, all we have been doing is reading tuples into memory. The time it takes to do that will depend on maintenance_work_mem, because that affects how many tuples fit in memory. But all the rest of the tuples need to be read sooner or later anyway, so pushing more of them to later doesn't improve things overall it just shifts timing around. After it reports the switch, it still needs to heapify the existing in-memory tuples before the tapesort proper can begin. This is where the true lost opportunities start to arise, as the large heap starts driving cache misses which would not happen at all under different settings. Once the existing tuples are heapified, it then continues to use the heap to pop tuples from it to write out to "tape", and to push newly read tuples onto it. This also suffers lost opportunities. Once all the tuples are written out and it starts merging, then the large maintenance_work_mem is no longer a penalty as the new heap is limited by the number of tapes, which is almost always much smaller. In fact this stage will actually be faster, but not by enough to make up for the earlier slow down. So it is not surprising that the time before the switch is reported is a small part of the overall time difference. Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general