Re: Virtual machines using RBD remount read-only on OSD slow requests

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Mon, 24 Nov 2014 20:14:34 +0100 (CET)

Hi,
try to mount your filesystems with errors=continue option

>From the mount (8) man page

errors={continue|remount-ro|panic}
Define the behaviour when an error is encountered.  (Either ignore errors
and just  mark  the  filesystem  erroneous and continue, or remount the
filesystem read-only, or panic and halt the system.)  The default is set in
the  filesystem superblock, and can be changed using tune2fs(8).

----- Mail original ----- 

De: "Paulo Almeida" <palmeida@xxxxxxxxxxxxxxxxx> 
À: ceph-users@xxxxxxxxxxxxxx 
Envoyé: Lundi 24 Novembre 2014 17:06:40 
Objet:  Virtual machines using RBD remount read-only on OSD slow requests 

Hi, 

I have a Ceph cluster with 4 disk servers, 14 OSDs and replica size of 
3. A number of KVM virtual machines are using RBD as their only storage 
device. Whenever some OSDs (always on a single server) have slow 
requests, caused, I believe, by flaky hardware or, in one occasion, by a 
S.M.A.R.T command that crashed the system disk of one of the disk 
servers, most virtual machines remount their disk read-only and need to 
be rebooted. 

One of the virtual machines still has Debian 6 installed, and it never 
crashes. It also has an ext3 filesystem, contrary to some other 
machines, which have ext4. ext3 does crash in systems with Debian 7, but 
those have different mount flags, such as "barrier" and "data=ordered". 
I suspect (but haven't tested) that using "nobarrier" may solve the 
problem, but that doesn't seem to be an ideal solution. 

Most of those machines have Debian 7 or Ubuntu 12.04, but two of them 
have Ubuntu 14.04 (and thus a more recent kernel) and they also remount 
read-only. 

I searched the mailing list and found a couple of relevant messages. One 
person seemed to have the same problem[1], but someone else replied that 
it didn't happen in his case ("I've had multiple VMs hang for hours at a 
time when I broke a Ceph cluster and after fixing it the VMs would start 
working again"). The other message[2] is not very informative. 

Are other people experiencing this problem? Is there a file system or 
kernel version that is recommended for KVM guests that would prevent it? 
Or does this problem indicate that something else is wrong and should be 
fixed? I did configure all machines to use "cache=writeback", but never 
investigated whether that makes a difference or even whether it is 
actually working. 

Thanks, 
Paulo Almeida 
Instituto Gulbenkian de Ciência, Oeiras, Portugal 

[1] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/8011 
[2] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/1742 

_______________________________________________ 
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com