Hi Alexandre, Thanks for your suggestion. I also considered using errors=continue, in line with the nobarrier idea, but I was afraid it might lead to silent corruption on errors not caused by slow requests on OSDs. I was hoping for a solution that would specifically allow slowness of the block device while still reacting to other errors, but I may have to accept the lesser evil or keep investigating. Cheers, Paulo On Mon, 2014-11-24 at 20:14 +0100, Alexandre DERUMIER wrote: > Hi, > try to mount your filesystems with errors=continue option > > > From the mount (8) man page > > errors={continue|remount-ro|panic} > Define the behaviour when an error is encountered. (Either ignore errors > and just mark the filesystem erroneous and continue, or remount the > filesystem read-only, or panic and halt the system.) The default is set in > the filesystem superblock, and can be changed using tune2fs(8). > > > > > ----- Mail original ----- > > De: "Paulo Almeida" <palmeida@xxxxxxxxxxxxxxxxx> > À: ceph-users@xxxxxxxxxxxxxx > Envoyé: Lundi 24 Novembre 2014 17:06:40 > Objet: Virtual machines using RBD remount read-only on OSD slow requests > > Hi, > > I have a Ceph cluster with 4 disk servers, 14 OSDs and replica size of > 3. A number of KVM virtual machines are using RBD as their only storage > device. Whenever some OSDs (always on a single server) have slow > requests, caused, I believe, by flaky hardware or, in one occasion, by a > S.M.A.R.T command that crashed the system disk of one of the disk > servers, most virtual machines remount their disk read-only and need to > be rebooted. > > One of the virtual machines still has Debian 6 installed, and it never > crashes. It also has an ext3 filesystem, contrary to some other > machines, which have ext4. ext3 does crash in systems with Debian 7, but > those have different mount flags, such as "barrier" and "data=ordered". > I suspect (but haven't tested) that using "nobarrier" may solve the > problem, but that doesn't seem to be an ideal solution. > > Most of those machines have Debian 7 or Ubuntu 12.04, but two of them > have Ubuntu 14.04 (and thus a more recent kernel) and they also remount > read-only. > > I searched the mailing list and found a couple of relevant messages. One > person seemed to have the same problem[1], but someone else replied that > it didn't happen in his case ("I've had multiple VMs hang for hours at a > time when I broke a Ceph cluster and after fixing it the VMs would start > working again"). The other message[2] is not very informative. > > Are other people experiencing this problem? Is there a file system or > kernel version that is recommended for KVM guests that would prevent it? > Or does this problem indicate that something else is wrong and should be > fixed? I did configure all machines to use "cache=writeback", but never > investigated whether that makes a difference or even whether it is > actually working. > > Thanks, > Paulo Almeida > Instituto Gulbenkian de Ciência, Oeiras, Portugal > > > [1] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/8011 > [2] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/1742 > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com