Re: Hung CephFS client

Jeff Layton <jlayton@xxxxxxxxxx> · Sun, 13 Oct 2019 07:19:43 -0400

On Sat, 2019-10-12 at 11:20 -0700, Robert LeBlanc wrote:
> On Fri, Oct 11, 2019 at 5:47 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > What kernel version is this? Do you happen to have a more readable stack
> > trace? Did this come from a hung task warning in the kernel?
> 
> $ uname -a
> Linux sun-gpu225 4.4.0-142-generic #168~14.04.1-Ubuntu SMP Sat Jan 19
> 11:26:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> 

That's pretty old. I'm not sure how aggressively Canonical backports
ceph patches.

> This was the best stack trace we could get. /proc was not helpful:
> root@sun-gpu225:/proc/77292# cat stack
> 
> 
> 
> [<ffffffffffffffff>] 0xffffffffffffffff
> 

A stack trace like the above generally means that the task is running in
userland. The earlier stack trace you sent might just indicate that it
was in the process of spinning on a lock when you grabbed the trace, but
isn't actually stuck in the kernel.

> We did not get messages of hung tasks from the kernel. This container
> was running for 9 days when the jobs should have completed in a matter
> of hours. They were not able to stop the container, but it still was
> using CPU. So it smells like uninterruptable sleep, but still using
> CPU which based on the trace looks like it's stuck in spinlock.
> 

That could be anything then, including userland bugs. What state was the
process in (maybe grab /proc/<pid>/status if this happens again?).

> Do you want me to get something more specific? Just tell me how.
> 

If you really think tasks are getting hung in the kernel, then you can
crash the box and get a vmcore if you have kdump set up. With that we
can analyze it and determine what it's doing.

If you suspect ceph is involved then you might want to turn up dynamic
debugging in the kernel and see what it's doing.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>