On Fri, Nov 21, 2014 at 4:56 AM, Jon Kåre Hellan <jon.kare.hellan@xxxxxxxxxx> wrote: > We are testing a Giant cluster - on virtual machines for now. We have seen > the same > problem two nights in a row: One of the OSDs gets stuck in uninterruptible > sleep. > The only way to get rid of it is apparently to reboot - kill -9, -11 and -15 > have all > been tried. > > The monitor apparently believes it is gone, because every 30 minutes we see > in the log: > lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another > ceph-osd still > running? (11) Resource temporarily unavailable > We interpret this as an attempt to start a new instance. > > There is a pastebin of the osd log from the night before last in: > http://pastebin.com/Y42GvGjr > Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy > The pid of the stuck OSD is 4222. syslog has call traces of pids 4405, 4406, > 4435, 4436, > which have been blocked for > 120 s. > > What can we do to get to the bottom of this? So, the OSD log you pasted includes a backtrace of an assert failure from the internal heartbeating, indicating that some threads went off and never came back (these are probably the threads making the syscalls that syslog is reporting on). It asserted and the OSD should be gone now since it triggers an unfriendly coredump and termination. The only thing I can think of is that maybe the system calls responsible for dumping the core out have *also* failed in a way we haven't seen before and so nothing's terminated. In any case, it's definitely related to your disks being too slow and the OS not handling it appropriately; I'd look at why the kernel is getting stuck. The specific backtraces there aren't familiar to me, but maybe somebody else has seen them. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com