OSD in uninterruptible sleep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We are testing a Giant cluster - on virtual machines for now. We have seen the same problem two nights in a row: One of the OSDs gets stuck in uninterruptible sleep. The only way to get rid of it is apparently to reboot - kill -9, -11 and -15 have all
been tried.

The monitor apparently believes it is gone, because every 30 minutes we see in the log: lock_fsid failed to lock /var/lib/ceph/osd/ceph-1/fsid, is another ceph-osd still
  running? (11) Resource temporarily unavailable
We interpret this as an attempt to start a new instance.

There is a pastebin of the osd log from the night before last in: http://pastebin.com/Y42GvGjr
Pastebin of syslog from last evening: http://pastebin.com/7riNWRsy
The pid of the stuck OSD is 4222. syslog has call traces of pids 4405, 4406, 4435, 4436,
which have been blocked for > 120 s.

What can we do to get to the bottom of this?

Context: This is a test cluster to evaluate Ceph. There are 3 monitor vms,
3 OSD vms each running 2 OSDs, 1 MSD vm and 1 radosgw vm. The vms are running Debian Wheezy under Hyper-V. OSD storage is xfs on virtual disks. The test load was a linux kernel compilation with the tree in cephfs. Silly, I know, but we needed a test load. We do not intend to use cephfs in production. Obviously, we would use physical OSD nodes
if we were to decide to deploy ceph in production.

Jon
Jon Kåre Hellan, UNINETT AS, Trondheim, Norway

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux