OSD in uninterruptible sleep

Jon Kåre Hellan <jon.kare.hellan@xxxxxxxxxx> · Fri, 21 Nov 2014 11:56:06 +0100

We are testing a Giant cluster - on virtual machines for now. We have 
seen the same
problem two nights in a row: One of the OSDs gets stuck in 
uninterruptible sleep.
The only way to get rid of it is apparently to reboot - kill -9, -11 and 
-15 have all
been tried.

The monitor apparently believes it is gone, because every 30 minutes we 
see in the log:
  lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another 
ceph-osd still
  running? (11) Resource temporarily unavailable
We interpret this as an attempt to start a new instance.

There is a pastebin of the osd log from the night before last in: 
http://pastebin.com/Y42GvGjr
Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy
The pid of the stuck OSD is 4222. syslog has call traces of pids 4405, 
4406, 4435, 4436,
which have been blocked for > 120 s.

What can we do to get to the bottom of this?

Context: This is a test cluster to evaluate Ceph. There are 3 monitor vms,
3 OSD vms each running 2 OSDs, 1 MSD vm and 1 radosgw vm. The vms are 
running Debian
Wheezy under Hyper-V. OSD storage is xfs on virtual disks. The test load 
was a linux
kernel compilation with the tree in cephfs. Silly, I know, but we needed 
a test load.
We do not intend to use cephfs in production. Obviously, we would use 
physical OSD nodes
if we were to decide to deploy ceph in production.

Jon
Jon Kåre Hellan, UNINETT AS, Trondheim, Norway

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com