Re: Virtual machines using RBD remount read-only on OSD slow requests

Paulo Almeida <palmeida@xxxxxxxxxxxxxxxxx> · Tue, 25 Nov 2014 11:06:37 +0000

Hi Alexandre,

Thanks for your suggestion. I also considered using errors=continue, in
line with the nobarrier idea, but I was afraid it might lead to silent
corruption on errors not caused by slow requests on OSDs. I was hoping
for a solution that would specifically allow slowness of the block
device while still reacting to other errors, but I may have to accept
the lesser evil or keep investigating.

Cheers,
Paulo

On Mon, 2014-11-24 at 20:14 +0100, Alexandre DERUMIER wrote:
> Hi,
> try to mount your filesystems with errors=continue option
> 
> 
> From the mount (8) man page
> 
> errors={continue|remount-ro|panic}
> Define the behaviour when an error is encountered.  (Either ignore errors
> and just  mark  the  filesystem  erroneous and continue, or remount the
> filesystem read-only, or panic and halt the system.)  The default is set in
> the  filesystem superblock, and can be changed using tune2fs(8).
> 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Paulo Almeida" <palmeida@xxxxxxxxxxxxxxxxx> 
> À: ceph-users@xxxxxxxxxxxxxx 
> Envoyé: Lundi 24 Novembre 2014 17:06:40 
> Objet:  Virtual machines using RBD remount read-only on OSD slow requests 
> 
> Hi, 
> 
> I have a Ceph cluster with 4 disk servers, 14 OSDs and replica size of 
> 3. A number of KVM virtual machines are using RBD as their only storage 
> device. Whenever some OSDs (always on a single server) have slow 
> requests, caused, I believe, by flaky hardware or, in one occasion, by a 
> S.M.A.R.T command that crashed the system disk of one of the disk 
> servers, most virtual machines remount their disk read-only and need to 
> be rebooted. 
> 
> One of the virtual machines still has Debian 6 installed, and it never 
> crashes. It also has an ext3 filesystem, contrary to some other 
> machines, which have ext4. ext3 does crash in systems with Debian 7, but 
> those have different mount flags, such as "barrier" and "data=ordered". 
> I suspect (but haven't tested) that using "nobarrier" may solve the 
> problem, but that doesn't seem to be an ideal solution. 
> 
> Most of those machines have Debian 7 or Ubuntu 12.04, but two of them 
> have Ubuntu 14.04 (and thus a more recent kernel) and they also remount 
> read-only. 
> 
> I searched the mailing list and found a couple of relevant messages. One 
> person seemed to have the same problem[1], but someone else replied that 
> it didn't happen in his case ("I've had multiple VMs hang for hours at a 
> time when I broke a Ceph cluster and after fixing it the VMs would start 
> working again"). The other message[2] is not very informative. 
> 
> Are other people experiencing this problem? Is there a file system or 
> kernel version that is recommended for KVM guests that would prevent it? 
> Or does this problem indicate that something else is wrong and should be 
> fixed? I did configure all machines to use "cache=writeback", but never 
> investigated whether that makes a difference or even whether it is 
> actually working. 
> 
> Thanks, 
> Paulo Almeida 
> Instituto Gulbenkian de Ciência, Oeiras, Portugal 
> 
> 
> [1] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/8011 
> [2] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/1742 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com