On Wed, Sep 5, 2018 at 8:34 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote: > > On 09/05/2018 02:01 AM, Greg Kroah-Hartman wrote: > >> --- > >> [ 9990.754641] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:1:155] > >> [ 9990.762601] RIP: 0010:smp_call_function_many+0x208/0x270 > >> [ 9990.762601] Code: e8 0d d1 77 00 3b 05 cb f0 24 01 0f 83 86 fe ff ff 48 63 d0 49 8b 0c 24 48 03 0c d5 00 f7 11 a7 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4d d0 4c 89 f2 4c 89 ee 44 89 It's stuck in this loop: loop: pause mov 0x18(%rcx),%edx and $0x1,%edx jne loop which is csd_lock_wait(). Judging by the offset in smp_call_function_many(), it's the final one (there's two: the other one is part of "csd_lock()"). But that's just a guess. Anyway, it means that we're waiting for another CPU to finish processing an IPI - either a previous one we sent asynchronously (if it's the earlier csd_lock() case) or the TLB IPI we just sent and we're waiting for completion of. > Not tested, but I see it in v4.17.19 and in v4.18.6-rc2. Turns out it is > related to heavy load, not to suspend/resume. At this point I suspect that > it may be an AMD/Ryzen specific problem - it looks like it disappears if I > add "kernel.randomize_va_space = 0" to /etc/sysctl.conf. No idea if it is a > CPU bug or some AMD specific code problem. I'll try to analyze it further. Ouch. Some IPI sending/receiving problem would be very very painful to debug if it's hw related. Linus