Re: CPU load spikes when CentOS tries to reclaim 'cached' memory

Jeff Janes <jeff.janes@xxxxxxxxx> · Thu, 5 Jun 2014 08:58:29 -0700

On Wed, Jun 4, 2014 at 5:27 PM, vlasmarias <vlasmarias@xxxxxxxxxxx> wrote:

For the past few days, we've been seeing unexpected extremely high CPU spikes

in our system. We observed the following: the 'free' memory would go down to

lower than 300 MB; at that point, 'cached' slowly starts to go down, and

then CPU starts to go way up.

It's almost as if the OS was not releasing 'cached' memory fast enough for

Postgres. Is that analysis correct? Is there a way to fix this?

This sounds like a kernel problem, probably either the zone reclaim issue, or the transparent huge pages issue.

I don't know the exact details off the top of my head, but both have been discussed a lot on both this list and the pgsql-hackers list.

Here's the session:

 04:58:37 load average: 2.37, free: 532, cached: 22852

 04:58:57 load average: 1.91, free: 451, cached: 22859

 04:59:17 load average: 1.82, free: 469, cached: 22866

 04:59:57 load average: 1.57, free: 387, cached: 22884

What tool is that?  I'm not familiar with this output format.

 max_connections              | 500                

While this is probably fundamentally a kernel problem, you are not doing yourself any favors by allowing 500 connections to a machine with 24 cores.  High numbers of connections can trigger poor kernel behavior.

Cheers,

Jeff