Re: vm fs corrupt after pgs stuck

James Harper <james.harper@xxxxxxxxxxxxxxxx> · Thu, 2 Jan 2014 21:40:35 +0000

> 
> I just had to restore an ms exchange database after an ceph hiccup (no actual
> data lost - Exchange is very good like that with its no loss restore!). The order
> of events went something like:
> 
> . Loss of connection on osd to the cluster network (public network was okay)
> . pgs reported stuck
> . stopped osd on the bad server
> . resolved network problem
> . restarted osd on the bad server
> . noticed that the vm running exchange had hung
> . rebooted and vm did a chkdsk automatically
> . exchange refused to mount the main mailbox store
> 
> I'm not using rbd caching or anything, so for ntfs to lose files like that means
> something fairly nasty happened. My best guess is that the loss of
> connectivity and function while ceph was figuring out what was going on
> meant that windows IO was frozen and started timing out, but I still can't see
> how that could result in corruption.
> 
> Any suggestions on how I could avoid this situation in the future would be
> greatly appreciated!
> 

Forgot to mention. This has also happened once previously when the OOM killer targeted ceph-osd.

James
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com