Re: High CPU on one core, but unable to find process responsible

David Rosenstrauch <darose@xxxxxxxxxx> · Mon, 12 Mar 2018 21:23:21 -0400

On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:

On 2018年03月12日 11:19, David Rosenstrauch wrote:
My server's been exhibiting some very strange behavior lately.  Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU.

Can you check whether you have enabled "Detailed CPU time" option in
htop's setup (F2 -> Display options -> "Detailed CPU time")?
 From my experience and understanging, htop's CPU meter is accounting
IO-wait/IRQ-response time by default but not showing them differently
unless you enabled the "Detailed CPU time" option.
And these waiting time is not accounted on each process or kernel
thread. Enabling that said option will revail more detailed CPU usage info.
High IO-wait or IRQ time is itself an indication of some misbehaving
hardware, but at least you can be sure that it is not by more
"dangerous" malwares or attacks.

Thanks for the suggestion.  So this issue happened again tonight, and I 
switched to "Detailed CPU time" to try to research it further. 
According to htop, the cpu usage is from "irq" (orange color).  I guess 
this would explain why I'm not seeing any process responsible too.

And it also might be related that I'm seeing these messages in my dmesg:

[  871.317377] perf: interrupt took too long (2506 > 2500), lowering 
kernel.perf_event_max_sample_rate to 79000
[ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering 
kernel.perf_event_max_sample_rate to 63000
[ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering 
kernel.perf_event_max_sample_rate to 50000

So if this issue is irq-based, I guess that means some piece of hardware 
is faulty or failing.  Any idea how I might go about pinning down which 
one?  Would there be info in the kernel log about this?  Or something 
that I can look at in /proc?

Thanks,

DR