Re: OSD in uninterruptible sleep

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 21 Nov 2014 13:19:09 -0600



On Fri, Nov 21, 2014 at 4:56 AM, Jon Kåre Hellan
<jon.kare.hellan@xxxxxxxxxx> wrote:
> We are testing a Giant cluster - on virtual machines for now. We have seen
> the same
> problem two nights in a row: One of the OSDs gets stuck in uninterruptible
> sleep.
> The only way to get rid of it is apparently to reboot - kill -9, -11 and -15
> have all
> been tried.
>
> The monitor apparently believes it is gone, because every 30 minutes we see
> in the log:
>   lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another
> ceph-osd still
>   running? (11) Resource temporarily unavailable
> We interpret this as an attempt to start a new instance.
>
> There is a pastebin of the osd log from the night before last in:
> http://pastebin.com/Y42GvGjr
> Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy
> The pid of the stuck OSD is 4222. syslog has call traces of pids 4405, 4406,
> 4435, 4436,
> which have been blocked for > 120 s.
>
> What can we do to get to the bottom of this?

So, the OSD log you pasted includes a backtrace of an assert failure
from the internal heartbeating, indicating that some threads went off
and never came back (these are probably the threads making the
syscalls that syslog is reporting on). It asserted and the OSD should
be gone now since it triggers an unfriendly coredump and termination.
The only thing I can think of is that maybe the system calls
responsible for dumping the core out have *also* failed in a way we
haven't seen before and so nothing's terminated.
In any case, it's definitely related to your disks being too slow and
the OS not handling it appropriately; I'd look at why the kernel is
getting stuck. The specific backtraces there aren't familiar to me,
but maybe somebody else has seen them.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com