I'm investigating intermitten kernel lockups in an HPC environment, with the RedHat kernel. The symptoms are seen as lockups of multiple ps commands, with one consuming full CPU: # ps aux | grep ps root 19557 68.9 0.0 108100 908 ? D Sep16 1045:37 ps --ppid 1 -o args= root 19871 0.0 0.0 108100 908 ? D Sep16 0:00 ps --ppid 1 -o args= SIGKILL on the busy one causes the other ps processes to run to completion (TERM has no effect). In this case I was able to run my own ps to see the process list, but not always. perf shows the locality of the spinning, roughly: proc_pid_cmdline get_user_pages handle_mm_fault mem_cgroup_try_charge_swapin mem_cgroup_reclaim There are two entry points, the codepaths taken are better shown by the attached profile of CPU time. We've had this behaviour since switching to Scientific Linux 6 (based on RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64. The example above is kernel 2.6.32-358.el6.x86_64. I haven't been able to get a re-producable case with which to test the mainline kernel; our large-scale automated use of ps is working as a fuzz-test and switching kernels like that is not an option unfortunately. Does this issue sound familiar? I'd appreciate any advice or information, or pointers to the mainline where such cases have been investigated. I could not find anything using Google, but this problem does not have an key word or error message. Many thanks -- Mark
Attachment:
ps-cgroup-reclaim.pdf
Description: Adobe PDF document