OSD slow requests causing disk aborts in KVM

Krzysztof Nowicki <krzysztof.a.nowicki@xxxxxxxxx> · Fri, 06 Feb 2015 09:16:30 +0000

Hi all,
I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage backend for a set of KVM virtual machines. The VMs use RBD for disk storage. On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain DISCARD support.

Each OSD node is running on a separate machine, using 3TB WD Black drive + Samsung SSD for journal. The machines used for OSD nodes are not equal in spec. Three of them are small servers, while one is a desktop PC. The last node is the one causing trouble. During high loads caused by remapping due to one of the other nodes going down I've experienced some slow requests. To my surprise however these slow requests caused aborts from the block device on the VM side, which ended up corrupting files.

What I wonder if such behaviour (aborts) is normal in case slow requests pile up. I always though that these requests would be delayed but eventually they'd be handled. Are there any tunables that would help me avoid such situations? I would really like to avoid VM outages caused by such corruption issues.

I can attach some logs if needed.

Best regards
Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com