On Tue, Nov 20, 2012 at 12:00 PM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote: > On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote: >> >> The freelist should never loop. It is written as a loop, but I think >> there is currently no code path which ends up with valid buffers being >> on the freelist, so that loop will never, or at least rarely, execute >> more than once. >> >>> Both of those operations are >>> dependent on the number of buffers being managed and so it's >>> reasonable to expect some workloads to increase contention with more >>> buffers. >> >> The clock sweep can depend on the number of buffers begin managed in a >> worst-case sense, but I've never seen any evidence (nor analysis) that >> this worst case can be obtained in reality on an ongoing basis. By >> constructing two pathological workloads which get switched between, I >> can get the worst-case to happen, but when it does happen the >> consequences are mild compared to the amount of time needed to set up >> the necessary transition. In other words, the worse-case can't be >> triggered often enough to make a meaningful difference. > > Yeah, good points; but (getting off topic here) : there have been > several documented cases of lowering shared buffers improving > performance under contention...the 'worst case' might be happening > more than expected. The ones that I am aware of (mostly Greg Smith's case studies) this has been for write-intensive work loads and are related to writes/fsyncs getting gummed up. Shaun Thomas reports one that is (I assume) not read intensive, but his diagnosis is that this is a kernel bug where a larger shared_buffers for no good reason causes the kernel to kill off its page cache. From the kernel's perspective, the freelist lock doesn't look any different from any other lwlock, so I doubt that is issue is related to freelist lock. > In particular, what happens when a substantial > percentage of the buffer pool is set with a non-zero usage count? The current clock sweep algorithm is an extraordinary usagecount decrementing machine. From what I know, the only way to get much more than half of the buffers to be non-zero usage count is for the clock-sweep to rarely run (in which case, it is hard to be the bottleneck if it rarely runs), or for most of the buffer-cache to be pinned simultaneously. > This seems unlikely, but possible? Take note: > > if (buf->refcount == 0) > { > if (buf->usage_count > 0) > { > buf->usage_count--; > trycounter = NBuffers; /* emphasis *./ > } > > ISTM time spent here isn't bounded except that as more time is spent > sweeping (more backends are thus waiting and not marking pages) the > usage counts decrease faster until you hit steady state. But that is a one time thing. Once you hit the steady state, how do you get away from it again, such that a large amount of work is needed again? > Smaller > buffer pool naturally would help in that scenario as your usage counts > would drop faster. They would drop at the same rate in absolute numbers, barring the smaller buffer_cache fitting entirely in the on-board CPU cache. They would drop faster in percentage terms, but they would also increase faster in percentage terms once a candidate is found and a new page read into it. > It strikes me as cavalier to be resetting > trycounter while sitting under the #1 known contention point for read > only workloads. The only use for the trycounter is to know when to ERROR out with "no unpinned buffers available", so not resetting that seems entirely wrong. I would contest "the #1 known contention point" claim. We know that the freelist lock is a point of contention under certain conditions, but we (or at least I) also know that it is the mere acquisition of this lock, and not the work done while it is held, that is important. If I add a spurious "LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); LWLockRelease(BufFreelistLock);" to each execution of StrategyGetBuffer, then contention kicks in twice as fast. But if I instead hack the clock sweep to run twice as far (ignore the first eligible buffer it finds, and go find another one) but all under the cover of a single BufFreelistLock acquisition, there was no meaningful increase in contention. This was all on a 4 socket x 2 core/socket opteron machine which I no longer have access to. Using a more modern 8 core on a single socket, I can't get it to collapse on BufFreelistLock at all, presumably because the cache coherence mechanisms are so much faster. > Shouldn't SBF() work on advisory basis and try to > force a buffer after N failed usage count attempts? I believe Simon tried that a couple commit-fests ago, and no one could show that it made a difference. Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general