vm fs corrupt after pgs stuck

James Harper <james.harper@xxxxxxxxxxxxxxxx> · Thu, 2 Jan 2014 21:14:02 +0000

I just had to restore an ms exchange database after an ceph hiccup (no actual data lost - Exchange is very good like that with its no loss restore!). The order of events went something like:

. Loss of connection on osd to the cluster network (public network was okay)
. pgs reported stuck
. stopped osd on the bad server
. resolved network problem
. restarted osd on the bad server
. noticed that the vm running exchange had hung
. rebooted and vm did a chkdsk automatically
. exchange refused to mount the main mailbox store

I'm not using rbd caching or anything, so for ntfs to lose files like that means something fairly nasty happened. My best guess is that the loss of connectivity and function while ceph was figuring out what was going on meant that windows IO was frozen and started timing out, but I still can't see how that could result in corruption.

Any suggestions on how I could avoid this situation in the future would be greatly appreciated!

Thanks

James
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com