Hi, I`ve observed this issue previously on an old 3.10 branch but wrote it off due to inability to reproduce in any meaningful way. Currently I am seeing it on 3.10 branch where all KVM-related and RCU-related issues are patched more or less for well-known issues. Way to obtain a problematic state: - run a hypervisor for essentially long time, it took a year and half previously for issue to come on the mentioned old branch, but for newer kernel and probably due to higher load it took roughly a half of a year, - suddenly a single VM obtains a lock and became unresponsive while all threads displaying Running state, under this lock VM is neither not killable via SIGKILL and not freezeable via freezer cgroup, the only obvious symptoms is that it does not consume any cpu cycles anymore (no counter inside sched info ) and of course it is non-debuggable anymore. As it follows, it is quite impossible to say at a glance where lock sits, as there is no distinctive processes which are at least sleeping and could be moved out of sight. It looks like I could have met pure scheduler issue, so if nothing from attached recursive stack/status dump would click on an idea, I`d CC scheduler folks. Timer/RCU configs are attached for the convenience. Thanks for looking into this! stack: http://xdel.ru/downloads/vm-sched-hang/stack.txt status: http://xdel.ru/downloads/vm-sched-hang/status.txt
CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y
# RCU Subsystem CONFIG_TREE_RCU=y # CONFIG_PREEMPT_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_USER_QS=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FANOUT_EXACT is not set # CONFIG_RCU_FAST_NO_HZ is not set # CONFIG_TREE_RCU_TRACE is not set CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_NONE is not set # CONFIG_RCU_NOCB_CPU_ZERO is not set CONFIG_RCU_NOCB_CPU_ALL=y # RCU Debugging # CONFIG_SPARSE_RCU_POINTER is not set # CONFIG_RCU_TORTURE_TEST is not set CONFIG_RCU_CPU_STALL_TIMEOUT=21 # CONFIG_RCU_CPU_STALL_INFO is not set # CONFIG_RCU_TRACE is not set