>>Can this timeout be increased in some way? I've searched around and found the /sys/block/sdx/device/timeout knob, which in my case is set to 30s. yes, sure echo 60 > /sys/block/sdx/device/timeout for 60s for example ----- Mail original ----- De: "Krzysztof Nowicki" <krzysztof.a.nowicki@xxxxxxxxx> À: "Andrey Korolyov" <andrey@xxxxxxx>, "aderumier" <aderumier@xxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Envoyé: Vendredi 13 Février 2015 08:18:26 Objet: Re: OSD slow requests causing disk aborts in KVM Thu Feb 12 2015 at 16:23:38 użytkownik Andrey Korolyov < andrey@xxxxxxx > napisał: On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki < krzysztof.a.nowicki@xxxxxxxxx > wrote: > Hi all, > > I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage > backend for a set of KVM virtual machines. The VMs use RBD for disk storage. > On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain > DISCARD support. > > Each OSD node is running on a separate machine, using 3TB WD Black drive + > Samsung SSD for journal. The machines used for OSD nodes are not equal in > spec. Three of them are small servers, while one is a desktop PC. The last > node is the one causing trouble. During high loads caused by remapping due > to one of the other nodes going down I've experienced some slow requests. To > my surprise however these slow requests caused aborts from the block device > on the VM side, which ended up corrupting files. > > What I wonder if such behaviour (aborts) is normal in case slow requests > pile up. I always though that these requests would be delayed but eventually > they'd be handled. Are there any tunables that would help me avoid such > situations? I would really like to avoid VM outages caused by such > corruption issues. > > I can attach some logs if needed. > > Best regards > Chris Hi, this is unevitable payoff for using scsi backend on a storage which is capable to slow enough operations. There was some argonaut/bobtail-era discussions in ceph ml, may be those readings can be interesting for you. AFAIR the scsi disk would about after 70s of non-receiving ack state for a pending operation. Can this timeout be increased in some way? I've searched around and found the /sys/block/sdx/device/timeout knob, which in my case is set to 30s. As for the versions I'm running all Ceph nodes on Gentoo with Ceph version 0.80.5. The VM guest in question is running Ubuntu 12.04 LTS with kernel 3.13. The guest filesystem is BTRFS. I'm thinking that the corruption may be some BTRFS bug. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com