Re: OSD slow requests causing disk aborts in KVM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thu Feb 12 2015 at 16:23:38 użytkownik Andrey Korolyov <andrey@xxxxxxx> napisał:

On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki
<krzysztof.a.nowicki@xxxxxxxxx> wrote:
> Hi all,
>
> I'm running a small Ceph cluster with 4 OSD nodes, which serves as a storage
> backend for a set of KVM virtual machines. The VMs use RBD for disk storage.
> On the VM side I'm using virtio-scsi instead of virtio-blk in order to gain
> DISCARD support.
>
> Each OSD node is running on a separate machine, using 3TB WD Black drive +
> Samsung SSD for journal. The machines used for OSD nodes are not equal in
> spec. Three of them are small servers, while one is a desktop PC. The last
> node is the one causing trouble. During high loads caused by remapping due
> to one of the other nodes going down I've experienced some slow requests. To
> my surprise however these slow requests caused aborts from the block device
> on the VM side, which ended up corrupting files.
>
> What I wonder if such behaviour (aborts) is normal in case slow requests
> pile up. I always though that these requests would be delayed but eventually
> they'd be handled. Are there any tunables that would help me avoid such
> situations? I would really like to avoid VM outages caused by such
> corruption issues.
>
> I can attach some logs if needed.
>
> Best regards
> Chris

Hi, this is unevitable payoff for using scsi backend on a storage
which is capable to slow enough operations. There was some
argonaut/bobtail-era discussions in ceph ml, may be those readings can
be interesting for you. AFAIR the scsi disk would about after 70s of
non-receiving ack state for a pending operation.
Can this timeout be increased in some way? I've searched around and found the /sys/block/sdx/device/timeout knob, which in my case is set to 30s.

As for the versions I'm running all Ceph nodes on Gentoo with Ceph version 0.80.5. The VM guest in question is running Ubuntu 12.04 LTS with kernel 3.13. The guest filesystem is BTRFS.

I'm thinking that the corruption may be some BTRFS bug.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux