Re: libvirt rbd issue

Rafael Lopez <rafael.lopez@xxxxxxxxxx> · Fri, 4 Sep 2015 17:19:11 +1000

We don't have thousands but these RBDs are in a pool backed by ~600ish.
I can see the fd count is up well past 10k, closer to 15k when I use a decent number of RBDs (eg. 16 or 32) and seems to increase more the bigger the file I write. Procs are almost 30k when writing a 50GB file across that number of OSDs.

the change in qemu.conf worked for me, using rhel7.1 with systemd.

On 3 September 2015 at 19:46, Jan Schermer <jan@xxxxxxxxxxx> wrote:
You're like the 5th person here (including me) that was hit by this.

Could I get some input from someone using CEPH with RBD and thousands of OSDs? How high did you have to go?

I only have ~200 OSDs and I had to bump the limit up to 10000 for VMs that have multiple volumes attached, this doesn't seem right? I understand this is the effect of striping a volume accross multiple PGs, but shouldn't this be more limited or somehow garbage collected?

And to get deeper - I suppose there will be one connection from QEMU to OSD for each NCQ queue? Or how does this work? blk-mq will likely be different again... Or is it decoupled from the virtio side of things by RBD cache if that's enabled? 

Anyway, out of the box, at least on OpenStack installations
1) anyone having more than a few OSDs should really bump this up by default.
2) librbd should handle this situation gracefully by recycling connections, instead of hanging
3) at least we should get a warning somewhere (in the libvirt/qemu log) - I don't think there's anything when the issue hits

Should I make tickets for this?

Jan
On 03 Sep 2015, at 02:57, Rafael Lopez <rafael.lopez@xxxxxxxxxx> wrote:

Hi Jan,
Thanks for the advice, hit the nail on the head.

I checked the limits and watched the no. of fd's and as it reached the soft limit (1024) thats when the transfer came to a grinding halt and the vm started locking up.

After your reply I also did some more googling and found another old thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/026187.html

I increased the max_files in qemu.conf and restarted libvirtd and the VM (as per Dan's solution in thread above), and now it seems to be happy copying any size files to the rbd. Confirmed the fd count is going past the previous soft limit of 1024 also.

Thanks again!!
Raf

On 2 September 2015 at 18:44, Jan Schermer <jan@xxxxxxxxxxx> wrote:
1) Take a look at the number of file descriptors the QEMU process is using, I think you are over the limits

pid=pid of qemu process

cat /proc/$pid/limits

echo /proc/$pid/fd/* | wc -w

2) Jumbo frames may be the cause, are they enabled on the rest of the network? In any case, get rid of NetworkManager ASAP and set it manually, though it looks like your NIC might not support them.

Jan

> On 02 Sep 2015, at 01:44, Rafael Lopez <rafael.lopez@xxxxxxxxxx> wrote:

>

> Hi ceph-users,

>

> Hoping to get some help with a tricky problem. I have a rhel7.1 VM guest (host machine also rhel7.1) with root disk presented from ceph 0.94.2-0 (rbd) using libvirt.

>

> The VM also has a second rbd for storage presented from the same ceph cluster, also using libvirt.

>

> The VM boots fine, no apparent issues with the OS root rbd. I am able to mount the storage disk in the VM, and create a file system. I can even transfer small files to it. But when I try to transfer a moderate size files, eg. greater than 1GB, it seems to slow to a grinding halt and eventually it locks up the whole system, and generates the kernel messages below.

>

> I have googled some *similar* issues around, but haven't come across some solid advice/fix. So far I have tried modifying the libvirt disk cache settings, tried using the latest mainline kernel (4.2+), different file systems (ext4, xfs, zfs) all produce similar results. I suspect it may be network related, as when I was using the mainline kernel I was transferring some files to the storage disk and this message came up, and the transfer seemed to stop at the same time:

>

> Sep  1 15:31:22 nas1-rds NetworkManager[724]: <error> [1441085482.078646] [platform/nm-linux-platform.c:2133] sysctl_set(): sysctl: failed to set '/proc/sys/net/ipv6/conf/eth0/mtu' to '9000': (22) Invalid argument

>

> I think maybe the key info to troubleshooting is that it seems to be OK for files under 1GB.

>

> Any ideas would be appreciated.

>

> Cheers,

> Raf

>

>

> Sep  1 16:04:15 nas1-rds kernel: INFO: task kworker/u8:1:60 blocked for more than 120 seconds.

> Sep  1 16:04:15 nas1-rds kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

> Sep  1 16:04:15 nas1-rds kernel: kworker/u8:1    D ffff88023fd93680     0    60      2 0x00000000

> Sep  1 16:04:15 nas1-rds kernel: Workqueue: writeback bdi_writeback_workfn (flush-252:80)

> Sep  1 16:04:15 nas1-rds kernel: ffff880230c136b0 0000000000000046 ffff8802313c4440 ffff880230c13fd8

> Sep  1 16:04:15 nas1-rds kernel: ffff880230c13fd8 ffff880230c13fd8 ffff8802313c4440 ffff88023fd93f48

> Sep  1 16:04:15 nas1-rds kernel: ffff880230c137b0 ffff880230fbcb08 ffffe8ffffd80ec0 ffff88022e827590

> Sep  1 16:04:15 nas1-rds kernel: Call Trace:

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff8160955d>] io_schedule+0x9d/0x130

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812b8d5f>] bt_get+0x10f/0x1a0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff81098230>] ? wake_up_bit+0x30/0x30

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812b90ef>] blk_mq_get_tag+0xbf/0xf0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812b4f3b>] __blk_mq_alloc_request+0x1b/0x1f0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812b68a1>] blk_mq_map_request+0x181/0x1e0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812b7a1a>] blk_sq_make_request+0x9a/0x380

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812aa28f>] ? generic_make_request_checks+0x24f/0x380

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812aa4a2>] generic_make_request+0xe2/0x130

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff812aa561>] submit_bio+0x71/0x150

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffffa01ddc55>] ext4_io_submit+0x25/0x50 [ext4]

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffffa01dde09>] ext4_bio_write_page+0x159/0x2e0 [ext4]

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffffa01d4f6d>] mpage_submit_page+0x5d/0x80 [ext4]

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffffa01d5232>] mpage_map_and_submit_buffers+0x172/0x2a0 [ext4]

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffffa01da313>] ext4_writepages+0x733/0xd60 [ext4]

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff81162b6e>] do_writepages+0x1e/0x40

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff811efe10>] __writeback_single_inode+0x40/0x220

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff811f0b0e>] writeback_sb_inodes+0x25e/0x420

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff811f0d6f>] __writeback_inodes_wb+0x9f/0xd0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff811f15b3>] wb_writeback+0x263/0x2f0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff811f2aec>] bdi_writeback_workfn+0x1cc/0x460

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff8108f0ab>] process_one_work+0x17b/0x470

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff8108fe8b>] worker_thread+0x11b/0x400

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff8108fd70>] ? rescuer_thread+0x400/0x400

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff8109726f>] kthread+0xcf/0xe0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff81613cfc>] ret_from_fork+0x7c/0xb0

> Sep  1 16:04:15 nas1-rds kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Rafael LopezData Storage Administrator
Servers & Storage (eSolutions)+61 3 990 59118

-- 
Rafael LopezData Storage Administrator
Servers & Storage (eSolutions)+61 3 990 59118

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com