Re: Proposal of tunable fix for scalability of 8.4

"Jignesh K. Shah" <J.K.Shah@xxxxxxx> · Wed, 18 Mar 2009 18:11:28 -0400

On 03/18/09 17:25, Robert Haas wrote:

  On Wed, Mar 18, 2009 at 1:43 PM, Scott Carey <scott@xxxxxxxxxxxxxxxxx> wrote:

        Its worth ruling out given that even if the likelihood is small, the fix is
easy.  However, I don¹t see the throughput drop from peak as more
concurrency is added that is the hallmark of this problem < usually with a
lot of context switching and a sudden increase in CPU use per transaction.

      The problem is that the proposed "fix" bears a strong resemblence to
attempting to improve your gas mileage by removing a few non-critical
parts from your card, like, say, the bumpers, muffler, turn signals,
windshield wipers, and emergency brake.

    The fix I was referring to as easy was using a connection pooler -- as a
reply to the previous post. Even if its a low likelihood that the connection
pooler fixes this case, its worth looking at.

Oh, OK.  There seem to be some smart people saying that's a pretty
high-likelihood fix.  I thought you were talking about the proposed
locking change.

      While it's true that the car
might be drivable in that condition (as long as nothing unexpected
happens), you're going to have a hard time convincing the manufacturer
to offer that as an options package.

    The original poster's request is for a config parameter, for experimentation
and testing by the brave. My own request was for that version of the lock to
prevent possible starvation but improve performance by unlocking all shared
at once, then doing all exclusives one at a time next, etc.

That doesn't prevent starvation in general, although it will for some workloads.

Anyway, it seems rather pointless to add a config parameter that isn't
at all safe, and adds overhead to a critical part of the system for
people who don't use it.  After all, if you find that it helps, what
are you going to do?  Turn it on in production?  I just don't see how
this is any good other than as a thought-experiment.

Actually the patch I submitted shows no overhead from what I have seen
and I think it is useful depending on workloads where it can be turned
on  even on production. 

  At any rate, as I understand it, even after Jignesh eliminated the
waits, he wasn't able to push his CPU utilization above 48%.  Surely
something's not right there.  And he also said that when he added a
knob to control the behavior, he got a performance improvement even
when the knob was set to 0, which corresponds to the behavior we have
already anyway.  So I'm very skeptical that there's something wrong
with either the system or the test.  Until that's understood and
fixed, I don't think that looking at the numbers is worth much.

I dont think anything is majorly wrong in my system.. Sometimes it is
PostgreSQL locks in play and sometimes it can be OS/system related
locks in play (network, IO, file system, etc).  Right now in my patch
after I fix waiting procarray  problem other PostgreSQL locks comes
into play: CLogControlLock, WALInsertLock , etc.  Right now out of the
box we have no means of tweaking something in production if you do land
in that problem. With the patch there is means of doing knob control to
tweak the bottlenecks of Locks for the main workload for which it is
put in production.

I still haven't seen any downsides with the patch yet other than
highlighting other bottlenecks in the system. (For example I haven't
seen a run where the tpm on my workload decreases as you increase the
number) What I am suggesting is run the patch and see if you find a
workload where you see a downside in performance and the lock
statistics output to see if it is pushing the bottleneck elsewhere more
likely WALInsertLock or CLogControlBlock. If yes then this patch gives
you the right tweaking opportunity to reduce stress on ProcArrayLock
for a workload while still not seriously stressing WALInsertLock or
CLogControlBlock.

Right now.. the standard answer applies.. nope you are running the
wrong workload for PostgreSQL, use a connection pooler or your own
application logic. Or maybe.. you have too many users for PostgreSQL
use some proprietary database.

-Jignesh

    I alluded to the three main ways of dealing with lock contention elsewhere.
Avoiding locks, making finer grained locks, and making locks faster.
All are worthy.  Some are harder to do than others.  Some have been heavily
tuned already.  Its a case by case basis.  And regardless, the unfair lock
is a good test tool.

In view of the caveats above, I'll give that a firm maybe.

...Robert