Re: disk timeouts in libvirt/qemu VMs...

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Mon, 27 Mar 2017 20:17:48 +0000

I can't guarantee it's the same as my issue, but from that it sounds the
same.

Jewel 10.2.4, 10.2.5 tested
hypervisors are proxmox qemu-kvm, using librbd
3 ceph nodes with mon+osd on each

-faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, iops
and bw limits on client side, jumbo frames, etc. all improve/smooth out
performance and mitigate the hangs, but don't prevent it.
-hangs are usually associated with blocked requests (I set the complaint
time to 5s to see them)
-hangs are very easily caused by rbd snapshot + rbd export-diff to do
incremental backup (one snap persistent, plus one more during backup)
-when qemu VM io hangs, I have to kill -9 the qemu process for it to
stop. Some broken VMs don't appear to be hung until I try to live
migrate them (live migrating all VMs helped test solutions)

Finally I have a workaround... disable exclusive-lock, object-map, and
fast-diff rbd features (and restart clients via live migrate).
(object-map and fast-diff appear to have no effect on dif or export-diff
... so I don't miss them). I'll file a bug at some point (after I move
all VMs back and see if it is still stable). And one other user on IRC
said this solved the same problem (also using rbd snapshots).

And strangely, they don't seem to hang if I put back those features,
until a few days later (making testing much less easy...but now I'm very
sure removing them prevents the issue)

I hope this works for you (and maybe gets some attention from devs too),
so you don't waste months like me.

On 03/27/17 19:31, Hall, Eric wrote:
> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel), using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and ceph hosts, we occasionally see hung processes (usually during boot, but otherwise as well), with errors reported in the instance logs as shown below.  Configuration is vanilla, based on openstack/ceph docs.
>
> Neither the compute hosts nor the ceph hosts appear to be overloaded in terms of memory or network bandwidth, none of the 67 osds are over 80% full, nor do any of them appear to be overwhelmed in terms of IO.  Compute hosts and ceph cluster are connected via a relatively quiet 1Gb network, with an IBoE net between the ceph nodes.  Neither network appears overloaded.
>
> I don’t see any related (to my eye) errors in client or server logs, even with 20/20 logging from various components (rbd, rados, client, objectcacher, etc.)  I’ve increased the qemu file descriptor limit (currently 64k... overkill for sure.)
>
> I “feels” like a performance problem, but I can’t find any capacity issues or constraining bottlenecks. 
>
> Any suggestions or insights into this situation are appreciated.  Thank you for your time,
> --
> Eric
>
>
> [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more than 120 seconds.
> [Fri Mar 24 20:30:40 2017]       Not tainted 3.13.0-52-generic #85-Ubuntu
> [Fri Mar 24 20:30:40 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Fri Mar 24 20:30:40 2017] jbd2/vda1-8     D ffff88043fd13180     0   226      2 0x00000000
> [Fri Mar 24 20:30:40 2017]  ffff88003728bbd8 0000000000000046 ffff880426900000 ffff88003728bfd8
> [Fri Mar 24 20:30:40 2017]  0000000000013180 0000000000013180 ffff880426900000 ffff88043fd13a18
> [Fri Mar 24 20:30:40 2017]  ffff88043ffb9478 0000000000000002 ffffffff811ef7c0 ffff88003728bc50
> [Fri Mar 24 20:30:40 2017] Call Trace:
> [Fri Mar 24 20:30:40 2017]  [<ffffffff811ef7c0>] ? generic_block_bmap+0x50/0x50
> [Fri Mar 24 20:30:40 2017]  [<ffffffff81726d2d>] io_schedule+0x9d/0x140
> [Fri Mar 24 20:30:40 2017]  [<ffffffff811ef7ce>] sleep_on_buffer+0xe/0x20
> [Fri Mar 24 20:30:40 2017]  [<ffffffff817271b2>] __wait_on_bit+0x62/0x90
> [Fri Mar 24 20:30:40 2017]  [<ffffffff811ef7c0>] ? generic_block_bmap+0x50/0x50
> [Fri Mar 24 20:30:40 2017]  [<ffffffff81727257>] out_of_line_wait_on_bit+0x77/0x90
> [Fri Mar 24 20:30:40 2017]  [<ffffffff810ab180>] ? autoremove_wake_function+0x40/0x40
> [Fri Mar 24 20:30:40 2017]  [<ffffffff811f0afa>] __wait_on_buffer+0x2a/0x30
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8128bb4d>] jbd2_journal_commit_transaction+0x185d/0x1ab0
> [Fri Mar 24 20:30:40 2017]  [<ffffffff810755df>] ? try_to_del_timer_sync+0x4f/0x70
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8128fe7d>] kjournald2+0xbd/0x250
> [Fri Mar 24 20:30:40 2017]  [<ffffffff810ab140>] ? prepare_to_wait_event+0x100/0x100
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8128fdc0>] ? commit_timeout+0x10/0x10
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8108b5d2>] kthread+0xd2/0xf0
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8108b500>] ? kthread_create_on_node+0x1c0/0x1c0
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8173304c>] ret_from_fork+0x7c/0xb0
> [Fri Mar 24 20:30:40 2017]  [<ffffffff8108b500>] ? kthread_create_on_node+0x1c0/0x1c0
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com