On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki <krzysztof.a.nowicki@xxxxxxxxx> wrote: > Hi all, > > I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage > backend for a set of KVM virtual machines. The VMs use RBD for disk storage. > On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain > DISCARD support. > > Each OSD node is running on a separate machine, using 3TB WD Black drive + > Samsung SSD for journal. The machines used for OSD nodes are not equal in > spec. Three of them are small servers, while one is a desktop PC. The last > node is the one causing trouble. During high loads caused by remapping due > to one of the other nodes going down I've experienced some slow requests. To > my surprise however these slow requests caused aborts from the block device > on the VM side, which ended up corrupting files. > > What I wonder if such behaviour (aborts) is normal in case slow requests > pile up. I always though that these requests would be delayed but eventually > they'd be handled. Are there any tunables that would help me avoid such > situations? I would really like to avoid VM outages caused by such > corruption issues. > > I can attach some logs if needed. > > Best regards > Chris Hi, this is unevitable payoff for using scsi backend on a storage which is capable to slow enough operations. There was some argonaut/bobtail-era discussions in ceph ml, may be those readings can be interesting for you. AFAIR the scsi disk would about after 70s of non-receiving ack state for a pending operation. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com