Re: VM with rbd volume hangs on write during load

Jeya Ganesh Babu Jegatheesan <jjeya@xxxxxxxxxxx> · Thu, 16 Jul 2015 07:20:00 +0000

Seems like we are hitting the qemu file descriptor limits. Thanks for
pointing out.

Thanks,
Jeyaganesh.

On 7/15/15, 2:12 AM, "Jan Schermer" <jan@xxxxxxxxxxx> wrote:

>We are getting the same log message as you, but not too often, and from
>what I gathered it is normal to see that.
>Not sure how often you are seeing those.
>
>How many disks are connected to that VM? Take a look in /proc/pid/fd and
>count the file descriptors, then compare that to /proc/pid/limits. My
>first guess is the problem is in here. Basicaly qemu needs to make a
>connection to every VM hosting the PGs the RBD image is on, and that
>could be thousands (and since you have close to 500 osds I guess you are
>hitting that limit much sooner than most folks).
>
>
>Jan
>
>> On 15 Jul 2015, at 10:23, Jeya Ganesh Babu Jegatheesan
>><jjeya@xxxxxxxxxxx> wrote:
>> 
>> 
>> 
>> On 7/15/15, 12:17 AM, "Jan Schermer" <jan@xxxxxxxxxxx> wrote:
>> 
>>> Do you have a comparison of the same workload on a different storage
>>>than
>>> CEPH?
>> 
>> I dont have a direct comparison data with a different storage. I did ran
>> similar workload with single VM which was booted off from a local disk
>>and
>> i didn't see any issue. But as it was a from a local disk(single disk),
>> the performance was not comparable with the one with Ceph.
>> 
>>> 
>>> I am asking because those messages indicate slow request - basically
>>>some
>>> operation took 120s (which would look insanely high to a storage
>>> administrator, but is sort of expected in a cloud environment). And
>>>even
>>> on a regular direct attached storage, some OPs can take that look but
>>> those issues are masked in drivers (they don¹t necessarily manifest as
>>> ³task blocked² or ³soft lockups² but just as 100% iowait for periods of
>>> time - which is considered normal under load).
>>> 
>>> In other words - are you seeing a real problem with your workload or
>>>just
>>> the messages?
>> 
>> The issue is that the access to the volumes gets stuck and never
>>recovers.
>> The jbd2 kernel thread locks up. The system recovers only after a
>>reboot.
>> 
>>> 
>>> If you don¹t have any slow ops then you either have the warning set to
>>> high, or those blocked operations consist of more than just one OP - it
>>> adds up.
>>> 
>>> It could also be caused by a network problem (like misconfigured
>>>offloads
>>> on network cards causing retransmissions/reorders and such) or if for
>>> example you run out of file desriptors on either the client or server
>>> side, it manifests as some requests getting stuck _without_ getting any
>>> slow ops on the ceph cluster side.
>> 
>> Yes, we are checking the network as well. Btw i see the following
>>messages
>> in the osd logs (somewhere around 5-6 message for the day per osd),
>>could
>> this point that there is some network issue? would this cause a write to
>> get stuck?
>> 	
>> 	ceph-osd.296.log:2015-07-14 19:17:04.907107 7ffd1fc43700  0 --
>> 10.163.45.3:6893/2046 submit_message osd_op_reply(562032
>> rbd_data.2f982f3c214f5
>> 	9de.000000000000003d [stat,set-alloc-hint object_size 4194304
>>write_size
>> 4194304,write 1253376~4096] v54601'407287 uv407287 ack = 0) v6 remote
>> 	, 10.163.43.1:0/1076109, failed lossy con, dropping message 0x11cd5600
>> 	
>> 	2015-07-14 17:35:24.209722 7ffc67c4d700  0 -- 10.163.45.3:6893/2046 >>
>> 10.163.42.14:0/1004886 pipe(0x18521c80 sd=255 :6893 s=0 pgs=0 cs=0 l=1
>> c=0xd86a680).accept
>> 	replacing existing (lossy) channel (new one lossy=1)
>> 
>> 
>>> 
>>> 
>>> And an obligatory question - you say your OSDs don¹t use much CPU, but
>>> how are the disks? Aren¹t some of them 100% utilized when this happens?
>> 
>> Disk as well are not fully utilized, the iops and bandwidth are quite
>>low.
>> The load as such is evenly distributed across the disks.
>> 
>>> 
>>> Jan
>>> 
>>>> On 15 Jul 2015, at 02:23, Jeya Ganesh Babu Jegatheesan
>>>> <jjeya@xxxxxxxxxxx> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 7/14/15, 4:56 PM, "ceph-users on behalf of Wido den Hollander"
>>>> <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of wido@xxxxxxxx> wrote:
>>>> 
>>>>> On 07/15/2015 01:17 AM, Jeya Ganesh Babu Jegatheesan wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> We have a Openstack + Ceph cluster based on Giant release. We use
>>>>>>ceph
>>>>>> for the VMs volumes including the boot volumes. Under load, we see
>>>>>>the
>>>>>> write access to the volumes stuck from within the VM. The same would
>>>>>> work after a VM reboot. The issue is seen with and without rbd
>>>>>>cache.
>>>>>> Let me know if this is some known issue and any way to debug
>>>>>>further.
>>>>>> The ceph cluster itself seems to be clean. We have currently
>>>>>>disabled
>>>>>> scrub and deep scrub. 'ceph -s' output as below.
>>>>>> 
>>>>> 
>>>>> Are you seeing slow requests in the system?
>>>> 
>>>> I dont see slow requests in the cluster.
>>>> 
>>>>> 
>>>>> Are any of the disks under the OSDs 100% busy or close to it?
>>>> 
>>>> Most of the OSDs use 20% of a core. There is no OSD process busy at
>>>> 100%.
>>>> 
>>>>> 
>>>>> Btw, the amount of PGs is rather high. You are at 88, while the
>>>>>formula
>>>>> recommends:
>>>>> 
>>>>> num_osd * 100 / 3 = 14k (cluster total)
>>>> 
>>>> We used 30 * num_osd per pool. We do have 4 pools, i believe thats the
>>>> why
>>>> the PG seems to be be high.
>>>> 
>>>>> 
>>>>> Wido
>>>>> 
>>>>>>   cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d
>>>>>>    health HEALTH_WARN noscrub,nodeep-scrub flag(s) set
>>>>>>    monmap e71: 9 mons at
>>>>>> 
>>>>>> 
>>>>>>{gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc01
>>>>>>0a
>>>>>> =1
>>>>>> 
>>>>>> 
>>>>>>0.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.
>>>>>>9:
>>>>>> 67
>>>>>> 
>>>>>> 
>>>>>>89/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gn
>>>>>>gs
>>>>>> vm
>>>>>> 010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election
>>>>>> epoch
>>>>>> 22246, quorum 0,1,2,3,4,5,6,7,8
>>>>>> 
>>>>>> 
>>>>>>gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gng
>>>>>>sv
>>>>>> c0
>>>>>> 11b,gngsvc011c,gngsvm011d
>>>>>>    osdmap e54600: 425 osds: 425 up, 425 in
>>>>>>           flags noscrub,nodeep-scrub
>>>>>>     pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects
>>>>>>           402 TB used, 941 TB / 1344 TB avail
>>>>>>              37620 active+clean
>>>>>> client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s
>>>>>> 
>>>>>> 
>>>>>> The traces we see in the VM's kernel are as below.
>>>>>> 
>>>>>> [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120
>>>>>> seconds.
>>>>>> [ 1080.553027]       Tainted: GF          O 3.13.0-34-generic
>>>>>> #60~precise1-Ubuntu
>>>>>> [ 1080.553157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>> disables this message.
>>>>>> [ 1080.553295] jbd2/vdb-8      D ffff88003687e3e0     0   813      2
>>>>>> 0x00000000
>>>>>> [ 1080.553298]  ffff880444fadb48 0000000000000002 ffff880455114440
>>>>>> ffff880444fadfd8
>>>>>> [ 1080.553302]  0000000000014440 0000000000014440 ffff88044a9317f0
>>>>>> ffff88044b7917f0
>>>>>> [ 1080.553303]  ffff880444fadb48 ffff880455114cd8 ffff88044b7917f0
>>>>>> ffffffff811fc670
>>>>>> [ 1080.553307] Call Trace:
>>>>>> [ 1080.553309]  [<ffffffff811fc670>] ? __wait_on_buffer+0x30/0x30
>>>>>> [ 1080.553311]  [<ffffffff8175b8b9>] schedule+0x29/0x70
>>>>>> [ 1080.553313]  [<ffffffff8175b98f>] io_schedule+0x8f/0xd0
>>>>>> [ 1080.553315]  [<ffffffff811fc67e>] sleep_on_buffer+0xe/0x20
>>>>>> [ 1080.553316]  [<ffffffff8175c052>] __wait_on_bit+0x62/0x90
>>>>>> [ 1080.553318]  [<ffffffff811fc670>] ? __wait_on_buffer+0x30/0x30
>>>>>> [ 1080.553320]  [<ffffffff8175c0fc>]
>>>>>>out_of_line_wait_on_bit+0x7c/0x90
>>>>>> [ 1080.553322]  [<ffffffff810aff70>] ?
>>>>>> wake_atomic_t_function+0x40/0x40
>>>>>> [ 1080.553324]  [<ffffffff811fc66e>] __wait_on_buffer+0x2e/0x30
>>>>>> [ 1080.553326]  [<ffffffff8129806b>]
>>>>>> jbd2_journal_commit_transaction+0x136b/0x1520
>>>>>> [ 1080.553329]  [<ffffffff810a1f75>] ? sched_clock_local+0x25/0x90
>>>>>> [ 1080.553331]  [<ffffffff8109a7b8>] ?
>>>>>>finish_task_switch+0x128/0x170
>>>>>> [ 1080.553333]  [<ffffffff8107891f>] ?
>>>>>>try_to_del_timer_sync+0x4f/0x70
>>>>>> [ 1080.553334]  [<ffffffff8129c5d8>] kjournald2+0xb8/0x240
>>>>>> [ 1080.553336]  [<ffffffff810afef0>] ? __wake_up_sync+0x20/0x20
>>>>>> [ 1080.553338]  [<ffffffff8129c520>] ? commit_timeout+0x10/0x10
>>>>>> [ 1080.553340]  [<ffffffff8108fa79>] kthread+0xc9/0xe0
>>>>>> [ 1080.553343]  [<ffffffff8108f9b0>] ?
>>>>>>flush_kthread_worker+0xb0/0xb0
>>>>>> [ 1080.553346]  [<ffffffff8176827c>] ret_from_fork+0x7c/0xb0
>>>>>> [ 1080.553349]  [<ffffffff8108f9b0>] ?
>>>>>>flush_kthread_worker+0xb0/0xb0
>>>>>> 
>>>>>> Thanks,
>>>>>> Jeyaganesh.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Wido den Hollander
>>>>> 42on B.V.
>>>>> Ceph trainer and consultant
>>>>> 
>>>>> Phone: +31 (0)20 700 9902
>>>>> Skype: contact42on
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com