Re: User concurrency thresholding: where do I look?

"Jignesh K. Shah" <J.K.Shah@xxxxxxx> · Thu, 26 Jul 2007 11:27:30 -0400

I will look for runs with longer samples..

Now the script could have mislabeled lock names.. Anyway digging into 
the one that seems to increase over time... I did stack profiles on how 
that increases... and here are some numbers..

For  600-850 Users: that potential mislabeled CheckPointStartLock  or 
LockID==12 comes from various sources where the top source (while system 
is still doing great) comes from:

             postgres`LWLockAcquire+0x1c8
             postgres`SimpleLruReadPage_ReadOnly+0xc
             postgres`TransactionIdGetStatus+0x14
             postgres`TransactionLogFetch+0x58
             postgres`TransactionIdDidCommit+0x4
             postgres`HeapTupleSatisfiesSnapshot+0x234
             postgres`heap_release_fetch+0x1a8
             postgres`index_getnext+0xf4
             postgres`IndexNext+0x7c
             postgres`ExecScan+0x8c
             postgres`ExecProcNode+0xb4
             postgres`ExecutePlan+0xdc
             postgres`ExecutorRun+0xb0
             postgres`PortalRunSelect+0x9c
             postgres`PortalRun+0x244
             postgres`exec_execute_message+0x3a0
             postgres`PostgresMain+0x1300
             postgres`BackendRun+0x278
             postgres`ServerLoop+0x63c
             postgres`PostmasterMain+0xc40
         8202100

             postgres`LWLockAcquire+0x1c8
             postgres`TransactionIdSetStatus+0x1c
             postgres`RecordTransactionCommit+0x2a8
             postgres`CommitTransaction+0xc8
             postgres`CommitTransactionCommand+0x90
             postgres`finish_xact_command+0x60
             postgres`exec_execute_message+0x3d8
             postgres`PostgresMain+0x1300
             postgres`BackendRun+0x278
             postgres`ServerLoop+0x63c
             postgres`PostmasterMain+0xc40
             postgres`main+0x394
             postgres`_start+0x108
        30822100

However at 900 Users where the big drop in throughput occurs:
It gives a different top "consumer" of time:

             postgres`LWLockAcquire+0x1c8
             postgres`TransactionIdSetStatus+0x1c
             postgres`RecordTransactionCommit+0x2a8
             postgres`CommitTransaction+0xc8
             postgres`CommitTransactionCommand+0x90
             postgres`finish_xact_command+0x60
             postgres`exec_execute_message+0x3d8
             postgres`PostgresMain+0x1300
             postgres`BackendRun+0x278
             postgres`ServerLoop+0x63c
             postgres`PostmasterMain+0xc40
             postgres`main+0x394
             postgres`_start+0x108
       406601300

             postgres`LWLockAcquire+0x1c8
             postgres`SimpleLruReadPage+0x1ac
             postgres`TransactionIdGetStatus+0x14
             postgres`TransactionLogFetch+0x58
             postgres`TransactionIdDidCommit+0x4
             postgres`HeapTupleSatisfiesUpdate+0x360
             postgres`heap_lock_tuple+0x27c
             postgres`ExecutePlan+0x33c
             postgres`ExecutorRun+0xb0
             postgres`PortalRunSelect+0x9c
             postgres`PortalRun+0x244
             postgres`exec_execute_message+0x3a0
             postgres`PostgresMain+0x1300
             postgres`BackendRun+0x278
             postgres`ServerLoop+0x63c
             postgres`PostmasterMain+0xc40
             postgres`main+0x394
             postgres`_start+0x108
       444523100

             postgres`LWLockAcquire+0x1c8
             postgres`SimpleLruReadPage+0x1ac
             postgres`TransactionIdGetStatus+0x14
             postgres`TransactionLogFetch+0x58
             postgres`TransactionIdDidCommit+0x4
             postgres`HeapTupleSatisfiesSnapshot+0x234
             postgres`heap_release_fetch+0x1a8
             postgres`index_getnext+0xf4
             postgres`IndexNext+0x7c
             postgres`ExecScan+0x8c
             postgres`ExecProcNode+0xb4
             postgres`ExecutePlan+0xdc
             postgres`ExecutorRun+0xb0
             postgres`PortalRunSelect+0x9c
             postgres`PortalRun+0x244
             postgres`exec_execute_message+0x3a0
             postgres`PostgresMain+0x1300
             postgres`BackendRun+0x278
             postgres`ServerLoop+0x63c
             postgres`PostmasterMain+0xc40
      1661300000

Maybe you all will understand more than I do on what it does here.. 
Looks like IndexNext has a problem at high number of users to me.. but I 
could be wrong..

-Jignesh

Tom Lane wrote:
"Jignesh K. Shah" <J.K.Shah@xxxxxxx> writes:

The count is only for a 10-second snapshot.. Plus remember there are 
about 1000 users running so the connection  being profiled only gets 
0.01 of the period on CPU..  And the count is for that CONNECTION only.

OK, that explains the low absolute levels of the counts, but if the
counts are for a regular backend then there's still a lot of bogosity
here.  Backends won't be taking the CheckpointLock at all, nor do they
take CheckpointStartLock in exclusive mode.  The bgwriter would do that
but it'd not be taking most of these other locks.  So I think the script
is mislabeling the locks somehow.

Also, elementary statistics should tell you that a sample taken as above
is going to have enormous amounts of noise.  You should be sampling over
a much longer period, say on the order of a minute of runtime, to have
numbers that are trustworthy.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq