Re: Proposal of tunable fix for scalability of 8.4

"Jignesh K. Shah" <J.K.Shah@xxxxxxx> · Thu, 12 Mar 2009 14:37:32 -0400

Title: Re:  Proposal of tunable fix for scalability of 8.4 

On 03/12/09 13:48, Scott Carey wrote:

  On 3/11/09 7:47 PM, "Tom Lane" <tgl@xxxxxxxxxxxxx>
wrote:

All I’m adding, is that
it makes some sense to me based on my experience in CPU / RAM bound
scalability tuning.  It was expressed that the test itself didn’t even
make sense.

I was wrong in my understanding of what the change did.  If it wakes
ALL waiters up there is an indeterminate amount of time a lock will
wait.

However, if instead of waking up all of them, if it only wakes up the
shared readers and leaves all the exclusive ones at the front of the
queue, there is no possibility of starvation since those exclusives
will be at the front of the line after the wake-up batch.

As for this being a use case that is important:

*  SSDs will drive the % of use cases that are not I/O bound up
significantly over the next couple years.  All postgres installations
with less than about 100GB of data TODAY could avoid being I/O bound
with current SSD technology, and those less than 2TB can do so as well
but at high expense or with less proven technology like the ZFS L2ARC
flash cache.

*  Intel will have a mainstream CPU that handles 12 threads (6 cores, 2
threads each) at the end of this year.  Mainstream two CPU systems will
have access to 24 threads and be common in 2010.  Higher end 4CPU boxes
will have access to 48 CPU threads.  Hardware thread count is only
going up.  This is the future.

SSDs are precisely my motivation of doing RAM based tests with
PostgreSQL. While I am waiting for my SSDs to arrive, I started to
emulate SSDs by putting the whole database on RAM which in sense are
better than SSDs so if we can tune with RAM disks then SSDs will be
covered.

What we have is a
pool of 2000 users and we start making each user do series of
transactions on different rows and see how much the database can handle
linearly before some bottleneck (system or database) kicks in and there
can be no more linear increase in active users. Many times there is
drop after reaching some value of active users. If all 2000 users can
scale linearly then another test with say 2500 can be executed .. All
to do is what's the limit we can go till typically there are no system
resources still remaining to be exploited.

That said the testkit that I am using is a lightweight OLTP typish
workload which a user runs against a preknown schema and between
various transactions that it does it emulates a wait time of 200ms.
That said it is some sense emulating a real user who clicks and then
waits to see what he got and does another click which results in
another transaction happening.  (Not exactly but you get the point).  Like
all workloads it is generally used to find bottlenecks in
systems before putting production stuff on it. 

That said my current environment I am having similar workloads and
seeing how many users can go to the point where system has no more CPU
resources available to do a linear growth in tpm. Generally as many of
you  mentioned you will see disk latency, network latency, cpu resource
problems, etc.. And thats the work I am doing right now.. I am working
around network latency by doing a private network, improving Operating
systems tunables to improve efficiency out there.. I am improving disk
latency by putting them on /RAM (and soon on SSDs).. However if I still
cannot consume all CPU then it means I am probably hit by locks . Using
PostgreSQL DTrace probes I can see what's happening..

At low user (100 users) counts my lock profiles from a user point of
view are as follows:

# dtrace -q -s 84_lwlock.d 1764

              Lock Id            Mode           State           Count

        ProcArrayLock          Shared         Waiting               1

      CLogControlLock          Shared        Acquired               2

        ProcArrayLock       Exclusive         Waiting               3

        ProcArrayLock       Exclusive        Acquired              24

           XidGenLock       Exclusive        Acquired              24

     FirstLockMgrLock          Shared        Acquired              25

      CLogControlLock       Exclusive        Acquired              26

  FirstBufMappingLock          Shared        Acquired              55

        WALInsertLock       Exclusive        Acquired              75

        ProcArrayLock          Shared        Acquired             178

       SInvalReadLock          Shared        Acquired             378

              Lock Id            Mode           State   Combined Time
(ns)

       SInvalReadLock                        Acquired               
29849

        ProcArrayLock          Shared         Waiting               
92261

        ProcArrayLock                        Acquired              
951470

     FirstLockMgrLock       Exclusive        Acquired             
1069064

      CLogControlLock       Exclusive        Acquired             
1295551

        ProcArrayLock       Exclusive         Waiting             
1758033

  FirstBufMappingLock       Exclusive        Acquired             
2078507

           XidGenLock       Exclusive        Acquired             
3460800

        WALInsertLock       Exclusive        Acquired            
12205466

       SInvalReadLock       Exclusive        Acquired            
42684236

        ProcArrayLock       Exclusive        Acquired            
57397139

As users grow beyond 1000 it changes to the following for the sample
user point of view

# dtrace -q  -s 84_lwlock.d 1764

              Lock Id            Mode          
State           Count

      CLogControlLock       Exclusive         Waiting               1

        WALInsertLock       Exclusive         Waiting               1

        ProcArrayLock       Exclusive        Acquired               7

           XidGenLock       Exclusive        Acquired               7

        ProcArrayLock       Exclusive         Waiting              10

      CLogControlLock          Shared        Acquired              13

        WALInsertLock       Exclusive        Acquired              23

      CLogControlLock       Exclusive        Acquired              30

        ProcArrayLock          Shared        Acquired              50

     FirstLockMgrLock          Shared        Acquired             104

       SInvalReadLock          Shared        Acquired             105

  FirstBufMappingLock          Shared        Acquired             106

              Lock Id            Mode           State   Combined Time
(ns)

        WALInsertLock       Exclusive         Waiting               
73990

      CLogControlLock       Exclusive         Waiting              
383066

           XidGenLock       Exclusive        Acquired              
408301

      CLogControlLock       Exclusive        Acquired             
1871642

        ProcArrayLock                        Acquired             
2825372

        WALInsertLock       Exclusive        Acquired             
3144580

     FirstLockMgrLock       Exclusive        Acquired             
3799818

  FirstBufMappingLock       Exclusive        Acquired             
4083473

       SInvalReadLock       Exclusive        Acquired            
20611120

        ProcArrayLock       Exclusive        Acquired            
37920098

        ProcArrayLock       Exclusive         Waiting          
3783942020

Thats similar to what I had seen last year.. But thats the reason I am
playing with lwlock.c to see how changing of how LWLockRelease() can be
modified to do different types of wake-ups have impact on this top 
waiting time which is basically waste of time from perspective of
application, operating system, cpu .  All I am saying is with tuning
flexibility we can actually reduce the time wasted and probably use
that time with acquired state while it is doing some useful work. 

I dont think I have misconfigured the system. I am just showing that
hey there are ways to cut down some inefficiencies here and showing
test points. I am also showing where it does seem to help performance.
It may not help in all case but I just gave you a test where it helps
performance where it is better than what it is.  

And again this is the third time I am saying.. the test users also have
some latency build up in them which is what generally is exploited to
get more users than number of CPUS on the system but that's the point
we want to exploit.. Otherwise if all new users begin to do their job
with no latency then we would need 6+ billion cpus to handle all
possible users. Typically as an administrator (System and database) I
can only tweak/control latencies within my domain, that is network,
disk, cpu's etc and those are what I am tweaking and coming to a
*Configured* environment and now trying to improve lock
contentions/waits in PostgreSQL so that we have an optimized setup.

I am trying another run where I limit the waked up threads to a
pre-configured number to see how various numbers pans out in terms of
throughput on this server.

Regards,

Jignesh