We are testing a Giant cluster - on virtual machines for now. We have
seen the same
problem two nights in a row: One of the OSDs gets stuck in
uninterruptible sleep.
The only way to get rid of it is apparently to reboot - kill -9, -11 and
-15 have all
been tried.
The monitor apparently believes it is gone, because every 30 minutes we
see in the log:
lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another
ceph-osd still
running? (11) Resource temporarily unavailable
We interpret this as an attempt to start a new instance.
There is a pastebin of the osd log from the night before last in:
http://pastebin.com/Y42GvGjr
Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy
The pid of the stuck OSD is 4222. syslog has call traces of pids 4405,
4406, 4435, 4436,
which have been blocked for > 120 s.
What can we do to get to the bottom of this?
Context: This is a test cluster to evaluate Ceph. There are 3 monitor vms,
3 OSD vms each running 2 OSDs, 1 MSD vm and 1 radosgw vm. The vms are
running Debian
Wheezy under Hyper-V. OSD storage is xfs on virtual disks. The test load
was a linux
kernel compilation with the tree in cephfs. Silly, I know, but we needed
a test load.
We do not intend to use cephfs in production. Obviously, we would use
physical OSD nodes
if we were to decide to deploy ceph in production.
Jon
Jon Kåre Hellan, UNINETT AS, Trondheim, Norway
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com