Re: High SYS CPU - need advise

Vlad <marchenko@xxxxxxxxx> · Fri, 16 Nov 2012 09:52:48 -0600

Merlin,

Yeah -- you're right, this is definitely spinlock issue.  Next steps:

*) in mostly read workloads, we have a couple of known frequent
offenders.  In particular the 'BufFreelistLock'.  One way we can
influence that guy is to try and significantly lower/raise shared
buffers.  So this is one thing to try.

server has 32gb with 3.2gb dedicated for share buffers. I've increased 
it to 13.2gb, the stall still happened (I have a way of controlling 
number of queries hitting postgresql by shutting down own memcache-based 
app cache, so to test for stall I temporary shut down few memcached 
servers).

*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on.  Hopefully
it's a lwlock -- this will make diagnosing the problem easier.

I've enabled that macro, seeing flying lwlock messages in the log (see 
below), even when there is no high-sys-cpu stall observed at the moment. 
Should I be looking for something in particular?

PID 17293 lwlock 5906: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5932: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5934: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10854: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10856: shacq 4 exacq 0 blk 0
PID 17293 lwlock 10858: shacq 15 exacq 0 blk 0
PID 17293 lwlock 16442: shacq 4 exacq 0 blk 0
PID 17293 lwlock 16596: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20458: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20460: shacq 10 exacq 0 blk 0
PID 17293 lwlock 20464: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20466: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20480: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20482: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20484: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20508: shacq 12 exacq 0 blk 0
PID 17293 lwlock 20510: shacq 6 exacq 0 blk 0
PID 17293 lwlock 20938: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20940: shacq 22 exacq 0 blk 0
PID 17293 lwlock 20942: shacq 28 exacq 0 blk 0
PID 17293 lwlock 20944: shacq 56 exacq 0 blk 0
PID 17293 lwlock 20946: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20948: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20950: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20952: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20954: shacq 7 exacq 0 blk 0
PID 17293 lwlock 20956: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20958: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20960: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20962: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20964: shacq 2 exacq 0 blk 0
PID 17293 lwlock 46624: shacq 24 exacq 0 blk 0
PID 17293 lwlock 68126: shacq 6 exacq 0 blk 0

*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan.  This long thread:
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html";
has a lot information about that case and deserves a review.

*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve.  This is dangerous, and could crash your
server/eat your data, so fair warning.

merlin

-- vlad

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general