On Mon, Nov 19, 2012 at 10:50 AM, Vlad <marchenko@xxxxxxxxx> wrote: > I just did a little experiment: extracted top four queries that were > executed the longest during stall times and launched pgbench test with 240 > clients. Yet I wasn't able to put the server into a stall with that. Also > load average was hitting 120+, it was all user cpu, single digit % system. > The system remained pretty responsive (on a keypress), in contrary to when > high-sys-cpu stall happens, showing similar LA numbers. > > This makes me think that there is probably a different (than ones I tried) > query or condition that is responsible for creating high-sys-cpu. Also, from > my limited knowledge of postgresql and kernel internals, this experiment > puts under question idea that linux scheduler might be the corner stone.... > > Here is a longer log, first column is PID. High-sys-cpu stall happened > around 10:42:37 > https://dl.dropbox.com/u/109778/postgresql-2012-11-19_103425-stripped.log > > I'll try to put pgbouncer in place and repeat, also though this will take > longer time to setup. yeah. interesting -- contention was much higher this time and that changes things. strange how it was missed earlier. you're getting bounced around a lot in lwlock especially (unfortunately we don't know which one). I'm going to hazard another guess: maybe the trigger here is when the number of contending backends exceeds some critical number (probably based on the number of cores) you see a quick cpu spike (causing more backends to lock and pile up) as cache line bouncing sets in. That spike doesn't last long, because the spinlocks quickly accumulate delay counts then punt to the scheduler which is unable to cope. The exact reason why this is happening to you in exactly this way (I've never seen it) is unclear. Also the line between symptom and cause is difficult to draw. unfortunately, in your case spinlock re-scheduling isn't helping. log entries like this one: 18764 [2012-11-19 10:43:50.124 CST] LOG: JJ spin delay from file sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29 are suggesting major problems. you're dangerously close to a stuck spinlock which is lights out for the database. merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general