thread stalls on broadwell systems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We are encountering occasional thread stalls on Broadwell systems that are running CentOS 7 and one or more VMs. In the benign cases, our application threads would stall for seconds, impacting performance of our software. In the fatal cases, the system would encounter a MCE that is the result of what Intel refers to as 3-strike timeout, which means the processor fails to retire an instruction in a timely fashion.

We wrote a debug utility that forces a real time thread to be scheduled on each CPU every 100ms. The utility detected in multiple occasions the real time threads were not being scheduled for seconds. In two of those occasions where we also obtained kernel dumps, the dumps reveal that a CPU was running in guest mode and had stopped handling IPIs for multiple seconds, blocking other runnable threads on the runqueue. Our software periodically pauses the VMs and we observed that occasionally the pause would take multiple seconds. This indicates that we cannot interrupt the CPU where the vCPU thread is running in a timely fashion, which is further evidence of CPU stall.

Our software has been installed on hundreds of systems, both Haswell and Broadwell, and this behavior is only observed on Broadwell. Has anyone seen anything like this? Any suggestion on debugging this problem further is very much appreciated.

Lei





[Index of Archives]     [Linux ia64]     [Linux Kernel]     [DCCP]     [Linux ARM]     [Yosemite News]     [Linux SCSI]     [Linux Hams]
  Powered by Linux