From: Robert Haas [robertmhaas@xxxxxxxxx] Sent: Thursday, March 19, 2009 8:45 PM To: Scott Carey Cc: Jignesh K. Shah; Greg Smith; Kevin Grittner; pgsql-performance@xxxxxxxxxxxxxx Subject: Re: Proposal of tunable fix for scalability of 8.4 > > >On Thu, Mar 19, 2009 at 5:43 PM, Scott Carey <scott@xxxxxxxxxxxxxxxxx> wrote: > >> Well, unless I'm misunderstanding something, waking all waiters every > >> time could lead to arbitrarily long delays for writers on mostly > >> read-only workloads... and by arbitrarily along, we mean to say > >> "potentially just about forever". That doesn't sound safe for > >> production to me. > > > > The other discussion going on indicates that that condition already can > > happen, shared can always currently cut in line while other shared locks > > have the lock, though I don't understand all the details. > > No. If the first process waiting for an LWLock wants an exclusive > lock, we wake up that process, and only that process. If the first > process waiting for an LWLock wants a shared lock, we wake up that > process, and the processes which it follow it in the queue that also > want shared locks. But if we come to a process which holds an > exclusive lock, we stop. So if the wait queue looks like this > SSSXSSSXSSS, then the first three processes will be woken up, but the > remainder will not. The new wait queue will look like this: XSSSXSSS > - and the exclusive waiter at the head of the queue is guaranteed to > get the next turn. Your description (much of which I cut out) is exactly how I understood it until Simon Riggs' post which changed my view and understanding. Under that situation, waking all shared will leave all XXXXX at the front and hence alternate shared/exclusive/shared/exclusive as long as both types are contending. Simon's post changed my view. Below is some cut/paste from it: NOTE: things without a > in front here represent Simon until the ENDQUOTE: QUOTE ----------- On Wed, 2009-03-18 at 11:45 +0000, Matthew Wakeling wrote: > On Wed, 18 Mar 2009, Simon Riggs wrote: > > I agree with that, apart from the "granting no more" bit. > > > > The most useful behaviour is just to have two modes: > > * exclusive-lock held - all other x locks welcome, s locks queue > > * shared-lock held - all other s locks welcome, x locks queue > > The problem with making all other locks welcome is that there is a > possibility of starvation. Imagine a case where there is a constant stream > of shared locks - the exclusive locks may never actually get hold of the > lock under the "all other shared locks welcome" strategy. That's exactly what happens now. ---------- > [Scott Carey] (Further down in Simon's post, a quote from months ago: ) ---------- "Each time a Shared request is dequeued, we effectively re-enable queue jumping, so a Shared request arriving during that point will actually jump ahead of Shared requests that were unlucky enough to arrive while an Exclusive lock was held. Worse than that, the new incoming Shared requests exacerbate the starvation, so the more non-adjacent groups of Shared lock requests there are in the queue, the worse the starvation of the exclusive requestors becomes. We are effectively randomly starving some shared locks as well as exclusive locks in the current scheme, based upon the state of the lock when they make their request." ENDQUOTE ( Simon Riggs, cut/paste by me. post from his post Wednesday 3/18 5:10 AM pacific time). ------------------ I read that to mean that what is happening now is that in ADDITION to your explanation of how the queue works, while a batch of shared locks are executing, NEW shared locks execute immediately and don't even queue. That is, there is shared request queue jumping. The queue operates as your description but not everythig queues. It seems pretty conclusive if that is truthful -- that there is starvation possible in the current system. At this stage, it would seem that neither of us are experts on the current behavior, or that Simon is wrong, or that I completely misunderstood his comments above. > Now, of course, EVENTUALLY one of the X guys will probably beat out > all the S-lock waiters and he'll get to do his thing. But there's no > upper bound on how long this can take, and if the rate at which S-lock > waiters are joining the queue is much higher than the rate at which > X-lock waiters are joining the queue, it may be quite a long time. And the average expected time and distribution of those events can be statistically calculated and empirically measured. The fact that there is a chance at all is not as important as the magitude of the chance and the distribution of those probabilities. > Even if the overall system throughput is better with this change, the > fact that the guys who need the X-lock get seriously shafted is a > really serious problem. If 'serious shafting' is so, yes! We only disagree on the current possibility of this and the magnitude/likelihood of it. By Simon's comments above the starvation possiblility is already the case. I am merely using that discussion as evidence. It may be wrong, so in reality we agree overall but both don't have enough knowledge to go much beyond that. I think we can both agree that IF the current system is unfair, then the 'wake all' system is roughly as unfair, and perhaps even more fair and that testing evidence (averages and standar deviations too!) should guide us. If the current system is truly fair and cannot have starvation, then the 'wake all' setup would be a step backwards on that front. That is why my early comments on this were to wake only the shared or alternate. (I think an unfair simple 'wake all' lock is still useful for experimentation and testing and perhaps configuration --we may differ on that). > If I start a million transactions on my > system and they all complete in average of 1 second each, that sounds > pretty good - unless it's because 999,999 of them completed almost > instantaneously and the last one took a million seconds. Measuring standard deviation / variance is always important. Averages alone are surely not good enough. Whether this is average time to commit a transaction (low level) or the average cost of a query plan (higher level), consistency is highly valuable. Better to have slightly longer average times and very high consistency than the opposite. > > Also, the tests on the 'wake all' version clearly aren't starving anything > > in a load test with thousands of threads and very heavy lock contention, > > mostly for shared locks. > > Instead throughput increases and all wait times decrease. > On the average, yes... I agree we would need more than the average to be confident. Although I am not opposed to letting a user decide between the two -- gaining performance and sacrificing some consistency. Its a common real-world tradeoff. > > There are several other proposals to make starvation less possible (wake > > only shared and other proposals that alternate between shared and exclusive; > > waking only X sized chunks, etc -- its all just investigation into fixing > > what can be improved on -- solutions that are easily testable should not > > just be thrown out: the first ones were just the easiest to try). > > Alternating between shared and exclusive is safe. But a lot more > testing in a lot more situations would be needed to determine whether > it is better, I think. Waking chunks of a certain size I believe will > produce a more complicated version of the problem described above. > > ...Robert The alternating proposal is the most elegant and based on my experience should also perform well. The two list solution for this is simpler and can probably be done without locking on the list adding with atomics (compare and set/swap). Appending to a linked list can be done lock-free safely as can atomically swapping out lists. Predominantly lock-free is the way to go for heavily contended situations like this. The proposal that compacts the list by freeing all shared, and compacts the exclusive remainders probably requires more locking and contention due to more complex list manipulation. I agree that the chunk version is probably more complicated than needed. Our disagreement here revolves around two things I believe: What the current functionality actually is, and how useful the brute force simple lock is as a tool and as a config option. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance