On Sun, Oct 13, 2019 at 4:19 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Sat, 2019-10-12 at 11:20 -0700, Robert LeBlanc wrote: > > $ uname -a > > Linux sun-gpu225 4.4.0-142-generic #168~14.04.1-Ubuntu SMP Sat Jan 19 > > 11:26:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > > > > That's pretty old. I'm not sure how aggressively Canonical backports > ceph patches. Just trying to understand if this may be fixed in a newer version, but we also have to balance NVidia drivers as well. > > This was the best stack trace we could get. /proc was not helpful: > > root@sun-gpu225:/proc/77292# cat stack > > > > > > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > A stack trace like the above generally means that the task is running in > userland. The earlier stack trace you sent might just indicate that it > was in the process of spinning on a lock when you grabbed the trace, but > isn't actually stuck in the kernel. I tried catting it multiple times, but it was always that. > > We did not get messages of hung tasks from the kernel. This container > > was running for 9 days when the jobs should have completed in a matter > > of hours. They were not able to stop the container, but it still was > > using CPU. So it smells like uninterruptable sleep, but still using > > CPU which based on the trace looks like it's stuck in spinlock. > > > > That could be anything then, including userland bugs. What state was the > process in (maybe grab /proc/<pid>/status if this happens again?). We still have this box up. Here is the output of status: root@sun-gpu225:/proc/77292# cat status Name: offline_percept State: R (running) Tgid: 77292 Ngid: 77986 Pid: 77292 PPid: 168913 TracerPid: 20719 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 256 Groups: 27 999 NStgid: 77292 2830 NSpid: 77292 2830 NSpgid: 169001 8 NSsid: 168913 1 VmPeak: 1094897144 kB VmSize: 1094639324 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 3512696 kB VmRSS: 3121848 kB VmData: 19331276 kB VmStk: 144 kB VmExe: 184 kB VmLib: 1060628 kB VmPTE: 8992 kB VmPMD: 88 kB VmSwap: 0 kB HugetlbPages: 0 kB Threads: 1 SigQ: 3/3090620 SigPnd: 0000000000040100 ShdPnd: 0000000000000001 SigBlk: 0000000000001000 SigIgn: 0000000001001000 SigCgt: 00000001800044e8 CapInh: 00000000a80425fb CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a80425fb CapAmb: 0000000000000000 Seccomp: 0 Speculation_Store_Bypass: thread vulnerable Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,ffffffff Cpus_allowed_list: 0-31 Mems_allowed: 00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 6499 nonvoluntary_ctxt_switches: 28044102 > > Do you want me to get something more specific? Just tell me how. > > > > If you really think tasks are getting hung in the kernel, then you can > crash the box and get a vmcore if you have kdump set up. With that we > can analyze it and determine what it's doing. > > If you suspect ceph is involved then you might want to turn up dynamic > debugging in the kernel and see what it's doing. I looked in /sys/kernel/debug/ceph/, but wasn't sure how to up the debugging that would be beneficial. We don't have a crash kernel loaded, so that won't be an option in this case. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1