Re: Server hitting 100% CPU usage, system comes to a crawl.

Brian Fehrle <brianf@xxxxxxxxxxxxxxxxxxx> · Thu, 27 Oct 2011 14:36:50 -0600



    On 10/27/2011 02:27 PM, Scott Mead wrote:
    

      On Thu, Oct 27, 2011 at 2:39 PM, Brian
        Fehrle <brianf@xxxxxxxxxxxxxxxxxxx>
        wrote:

        
          Hi all, need some help/clues on tracking down a performance
          issue.

          
          PostgreSQL version: 8.3.11

          
          I've got a system that has 32 cores and 128 gigs of ram. We
          have connection pooling set up, with about 100 - 200
          persistent connections open to the database. Our applications
          then use these connections to query the database constantly,
          but when a connection isn't currently executing a query, it's
          <IDLE>. On average, at any given time, there are 3 - 6
          connections that are actually executing a query, while the
          rest are <IDLE>.

        
          Remember, when you read pg_stat_activity, it is showing you
          query activity from that exact specific moment in time.  Just
          because it looks like only 3-6 connections are executing,
          doesn't mean that 200 aren't actually executing < .1ms
          statements.  With such a beefy box, I would see if you can
          examine any stats from your connection pooler to find out how
          many connections are actually getting used.

          
    Correct, we're getting a few hundred transactions per second, but
    under normal operation, polling pg_stat_activity will show the
    average of 3 - 6 queries that were running at that moment, and those
    queries run for an average of 5 - 7 seconds. So my belief is that
    something happens to the system where either a) We get a ton more
    queries than normal from the application (currently hunting down
    data to support this), or b) the overall speed of the system slows
    down so that all queries increase in time so much that polling
    pg_stat_activity lets me actually see them. 

    
          About once a day, queries that normally take just a few
          seconds slow way down, and start to pile up, to the point
          where instead of just having 3-6 queries running at any given
          time, we get 100 - 200. The whole system comes to a crawl, and
          looking at top, the CPU usage is 99%.

          
          Looking at top, I see no SWAP usage, very little IOWait, and
          there are a large number of postmaster processes at 100% cpu
          usage (makes sense, at this point there are 150 or so queries
          currently executing on the database).

          
           Tasks: 713 total,  44 running, 668 sleeping,   0 stopped,   1
          zombie

          Cpu(s):  4.4%us, 92.0%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi,
           0.3%si,  0.2%st

          Mem:  134217728k total, 131229972k used,  2987756k free,  
          462444k buffers

          Swap:  8388600k total,      296k used,  8388304k free,
          119029580k cached

          
          In the past, we noticed that autovacuum was hitting some large
          tables at the same time this happened, so we turned autovacuum
          off to see if that was the issue, and it still happened
          without any vacuums running.

        
        That was my next question :) 

        
          We also ruled out checkpoints being the cause.

          
        How exactly did you rule this out?  Just because a
          checkpoint is over doesn't mean that it hasn't had a negative
          effect on the OS cache.  If you're stuck going to disk, that
          could be hurting you (that being said, you do point to a low
          I/O wait above, so you're probably correct in ruling this
          out).

           
    Checkpoint settings were set to the default per install. 5 minute
    timeout, 0.5 completion target, and 30s warning. Looking at the
    logs, we were getting a checkpoint every 5 minutes on the dot. 

    
    I looked at the data in pg_stat_database and noticed that buffers
    written by checkpoints are near 4X that of the background writer. So
    I implemented some changes to get more to be written by the
    background writer, including increasing the checkpoint timeout to 30
    minutes, and setting the frequency of the bgwriter wait time from
    200ms to 50ms.

    
    checkpoints now happen 30 mins apart on the dot, and there was not a
    checkpoint happening the last time this issue of major slowdown
    occured.

    
          I'm currently digging through some statistics I've been
          gathering to see if traffic increased at all, or remained the
          same when the slowdown occurred. I'm also digging through the
          logs from the postgresql cluster (I increased verbosity
          yesterday), looking for any clues. Any suggestions or clues on
          where to look for this to see what can be causing a slowdown
          like this would be greatly appreciated.

          
        Are you capturing table-level stats from
          pg_stat_user_[tables | indexes]?  Just because a server
          doesn't look busy doesn't mean that you're not doing 1000
          index scans per second returning 1000 tuples each time.

          
    I am not grabbing any of those at the moment, I'll look into those.
    

    - Brian F

    
        --Scott

           
          Thanks,

             - Brian F

          
            -- 

            Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

            To make changes to your subscription:

            http://www.postgresql.org/mailpref/pgsql-general