On Thu, Jun 5, 2014 at 2:47 PM, Deron <fecastle@xxxxxxxxx> wrote: > We saw very similar issues with a CentOS server with 40 cores (32 > virtualized) when moving from a physical server to a virtual server (I think > it had 128GB RAM). Never had the problem on a physical server. We checked > the same things as noted here, but never found a bug. We really thought it > had something to do with NUMA zone reclaim, but could never prove that. > In our case it was all kernel time in the guest, all CPUs at 100%. > Sometimes it would last for a few seconds or minutes. Sometimes we would go > days without a problem, and then it would completely tank. > > If you figure out what is going on, I would like to know (especially if it > is virtualized). There is a class of problems in virutalized enviroment that come from over-aggressive reclaiming of memory from the guest to the host. When the guest tries to access the 'unpinned' memory it will manifest as high latency memory reads and show up as high user time. That may or may not be the case here. What we'd need from the OP to get a better diagnosis is: *) top/sar output showing if the load average is due to high user,sys, or iowait *) is/isnot virtualized as noted above *) captured 'perf' snapshot during slowdown, particularly if we are seeing high user space loads. For example, we could be looking at high spinlock activity (which seems unlikely given how the problem is described but is something to rule out for sure). merlin