ps lockups, cgroup memory reclaim

Mark Hills <mark@xxxxxxxx> · Tue, 17 Sep 2013 16:50:42 +0100 (BST)

I'm investigating intermitten kernel lockups in an HPC environment, with 
the RedHat kernel.

The symptoms are seen as lockups of multiple ps commands, with one 
consuming full CPU:

  # ps aux | grep ps
  root     19557 68.9  0.0 108100   908 ?        D    Sep16 1045:37 ps --ppid 1 -o args=
  root     19871  0.0  0.0 108100   908 ?        D    Sep16   0:00 ps --ppid 1 -o args=

SIGKILL on the busy one causes the other ps processes to run to completion 
(TERM has no effect).

In this case I was able to run my own ps to see the process list, but not 
always.

perf shows the locality of the spinning, roughly:

  proc_pid_cmdline
  get_user_pages
  handle_mm_fault
  mem_cgroup_try_charge_swapin
  mem_cgroup_reclaim

There are two entry points, the codepaths taken are better shown by the 
attached profile of CPU time.

We've had this behaviour since switching to Scientific Linux 6 (based on 
RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64.

The example above is kernel 2.6.32-358.el6.x86_64.

I haven't been able to get a re-producable case with which to test the 
mainline kernel; our large-scale automated use of ps is working as a 
fuzz-test and switching kernels like that is not an option unfortunately.

Does this issue sound familiar? I'd appreciate any advice or information, 
or pointers to the mainline where such cases have been investigated.

I could not find anything using Google, but this problem does not have an 
key word or error message.

Many thanks

-- 
Mark
Attachment:
ps-cgroup-reclaim.pdf

Description: Adobe PDF document