Re: OSD slow requests causing disk aborts in KVM

Andrey Korolyov <andrey@xxxxxxx> · Thu, 12 Feb 2015 19:23:17 +0400

On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki
<krzysztof.a.nowicki@xxxxxxxxx> wrote:
> Hi all,
>
> I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage
> backend for a set of KVM virtual machines. The VMs use RBD for disk storage.
> On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain
> DISCARD support.
>
> Each OSD node is running on a separate machine, using 3TB WD Black drive +
> Samsung SSD for journal. The machines used for OSD nodes are not equal in
> spec. Three of them are small servers, while one is a desktop PC. The last
> node is the one causing trouble. During high loads caused by remapping due
> to one of the other nodes going down I've experienced some slow requests. To
> my surprise however these slow requests caused aborts from the block device
> on the VM side, which ended up corrupting files.
>
> What I wonder if such behaviour (aborts) is normal in case slow requests
> pile up. I always though that these requests would be delayed but eventually
> they'd be handled. Are there any tunables that would help me avoid such
> situations? I would really like to avoid VM outages caused by such
> corruption issues.
>
> I can attach some logs if needed.
>
> Best regards
> Chris

Hi, this is unevitable payoff for using scsi backend on a storage
which is capable to slow enough operations. There was some
argonaut/bobtail-era discussions in ceph ml, may be those readings can
be interesting for you. AFAIR the scsi disk would about after 70s of
non-receiving ack state for a pending operation.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com