Seems like we are hitting the qemu file descriptor limits. Thanks for pointing out. Thanks, Jeyaganesh. On 7/15/15, 2:12 AM, "Jan Schermer" <jan@xxxxxxxxxxx> wrote: >We are getting the same log message as you, but not too often, and from >what I gathered it is normal to see that. >Not sure how often you are seeing those. > >How many disks are connected to that VM? Take a look in /proc/pid/fd and >count the file descriptors, then compare that to /proc/pid/limits. My >first guess is the problem is in here. Basicaly qemu needs to make a >connection to every VM hosting the PGs the RBD image is on, and that >could be thousands (and since you have close to 500 osds I guess you are >hitting that limit much sooner than most folks). > > >Jan > >> On 15 Jul 2015, at 10:23, Jeya Ganesh Babu Jegatheesan >><jjeya@xxxxxxxxxxx> wrote: >> >> >> >> On 7/15/15, 12:17 AM, "Jan Schermer" <jan@xxxxxxxxxxx> wrote: >> >>> Do you have a comparison of the same workload on a different storage >>>than >>> CEPH? >> >> I dont have a direct comparison data with a different storage. I did ran >> similar workload with single VM which was booted off from a local disk >>and >> i didn't see any issue. But as it was a from a local disk(single disk), >> the performance was not comparable with the one with Ceph. >> >>> >>> I am asking because those messages indicate slow request - basically >>>some >>> operation took 120s (which would look insanely high to a storage >>> administrator, but is sort of expected in a cloud environment). And >>>even >>> on a regular direct attached storage, some OPs can take that look but >>> those issues are masked in drivers (they don¹t necessarily manifest as >>> ³task blocked² or ³soft lockups² but just as 100% iowait for periods of >>> time - which is considered normal under load). >>> >>> In other words - are you seeing a real problem with your workload or >>>just >>> the messages? >> >> The issue is that the access to the volumes gets stuck and never >>recovers. >> The jbd2 kernel thread locks up. The system recovers only after a >>reboot. >> >>> >>> If you don¹t have any slow ops then you either have the warning set to >>> high, or those blocked operations consist of more than just one OP - it >>> adds up. >>> >>> It could also be caused by a network problem (like misconfigured >>>offloads >>> on network cards causing retransmissions/reorders and such) or if for >>> example you run out of file desriptors on either the client or server >>> side, it manifests as some requests getting stuck _without_ getting any >>> slow ops on the ceph cluster side. >> >> Yes, we are checking the network as well. Btw i see the following >>messages >> in the osd logs (somewhere around 5-6 message for the day per osd), >>could >> this point that there is some network issue? would this cause a write to >> get stuck? >> >> ceph-osd.296.log:2015-07-14 19:17:04.907107 7ffd1fc43700 0 -- >> 10.163.45.3:6893/2046 submit_message osd_op_reply(562032 >> rbd_data.2f982f3c214f5 >> 9de.000000000000003d [stat,set-alloc-hint object_size 4194304 >>write_size >> 4194304,write 1253376~4096] v54601'407287 uv407287 ack = 0) v6 remote >> , 10.163.43.1:0/1076109, failed lossy con, dropping message 0x11cd5600 >> >> 2015-07-14 17:35:24.209722 7ffc67c4d700 0 -- 10.163.45.3:6893/2046 >> >> 10.163.42.14:0/1004886 pipe(0x18521c80 sd=255 :6893 s=0 pgs=0 cs=0 l=1 >> c=0xd86a680).accept >> replacing existing (lossy) channel (new one lossy=1) >> >> >>> >>> >>> And an obligatory question - you say your OSDs don¹t use much CPU, but >>> how are the disks? Aren¹t some of them 100% utilized when this happens? >> >> Disk as well are not fully utilized, the iops and bandwidth are quite >>low. >> The load as such is evenly distributed across the disks. >> >>> >>> Jan >>> >>>> On 15 Jul 2015, at 02:23, Jeya Ganesh Babu Jegatheesan >>>> <jjeya@xxxxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 7/14/15, 4:56 PM, "ceph-users on behalf of Wido den Hollander" >>>> <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of wido@xxxxxxxx> wrote: >>>> >>>>> On 07/15/2015 01:17 AM, Jeya Ganesh Babu Jegatheesan wrote: >>>>>> Hi, >>>>>> >>>>>> We have a Openstack + Ceph cluster based on Giant release. We use >>>>>>ceph >>>>>> for the VMs volumes including the boot volumes. Under load, we see >>>>>>the >>>>>> write access to the volumes stuck from within the VM. The same would >>>>>> work after a VM reboot. The issue is seen with and without rbd >>>>>>cache. >>>>>> Let me know if this is some known issue and any way to debug >>>>>>further. >>>>>> The ceph cluster itself seems to be clean. We have currently >>>>>>disabled >>>>>> scrub and deep scrub. 'ceph -s' output as below. >>>>>> >>>>> >>>>> Are you seeing slow requests in the system? >>>> >>>> I dont see slow requests in the cluster. >>>> >>>>> >>>>> Are any of the disks under the OSDs 100% busy or close to it? >>>> >>>> Most of the OSDs use 20% of a core. There is no OSD process busy at >>>> 100%. >>>> >>>>> >>>>> Btw, the amount of PGs is rather high. You are at 88, while the >>>>>formula >>>>> recommends: >>>>> >>>>> num_osd * 100 / 3 = 14k (cluster total) >>>> >>>> We used 30 * num_osd per pool. We do have 4 pools, i believe thats the >>>> why >>>> the PG seems to be be high. >>>> >>>>> >>>>> Wido >>>>> >>>>>> cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d >>>>>> health HEALTH_WARN noscrub,nodeep-scrub flag(s) set >>>>>> monmap e71: 9 mons at >>>>>> >>>>>> >>>>>>{gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc01 >>>>>>0a >>>>>> =1 >>>>>> >>>>>> >>>>>>0.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43. >>>>>>9: >>>>>> 67 >>>>>> >>>>>> >>>>>>89/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gn >>>>>>gs >>>>>> vm >>>>>> 010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election >>>>>> epoch >>>>>> 22246, quorum 0,1,2,3,4,5,6,7,8 >>>>>> >>>>>> >>>>>>gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gng >>>>>>sv >>>>>> c0 >>>>>> 11b,gngsvc011c,gngsvm011d >>>>>> osdmap e54600: 425 osds: 425 up, 425 in >>>>>> flags noscrub,nodeep-scrub >>>>>> pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects >>>>>> 402 TB used, 941 TB / 1344 TB avail >>>>>> 37620 active+clean >>>>>> client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s >>>>>> >>>>>> >>>>>> The traces we see in the VM's kernel are as below. >>>>>> >>>>>> [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120 >>>>>> seconds. >>>>>> [ 1080.553027] Tainted: GF O 3.13.0-34-generic >>>>>> #60~precise1-Ubuntu >>>>>> [ 1080.553157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>>> disables this message. >>>>>> [ 1080.553295] jbd2/vdb-8 D ffff88003687e3e0 0 813 2 >>>>>> 0x00000000 >>>>>> [ 1080.553298] ffff880444fadb48 0000000000000002 ffff880455114440 >>>>>> ffff880444fadfd8 >>>>>> [ 1080.553302] 0000000000014440 0000000000014440 ffff88044a9317f0 >>>>>> ffff88044b7917f0 >>>>>> [ 1080.553303] ffff880444fadb48 ffff880455114cd8 ffff88044b7917f0 >>>>>> ffffffff811fc670 >>>>>> [ 1080.553307] Call Trace: >>>>>> [ 1080.553309] [<ffffffff811fc670>] ? __wait_on_buffer+0x30/0x30 >>>>>> [ 1080.553311] [<ffffffff8175b8b9>] schedule+0x29/0x70 >>>>>> [ 1080.553313] [<ffffffff8175b98f>] io_schedule+0x8f/0xd0 >>>>>> [ 1080.553315] [<ffffffff811fc67e>] sleep_on_buffer+0xe/0x20 >>>>>> [ 1080.553316] [<ffffffff8175c052>] __wait_on_bit+0x62/0x90 >>>>>> [ 1080.553318] [<ffffffff811fc670>] ? __wait_on_buffer+0x30/0x30 >>>>>> [ 1080.553320] [<ffffffff8175c0fc>] >>>>>>out_of_line_wait_on_bit+0x7c/0x90 >>>>>> [ 1080.553322] [<ffffffff810aff70>] ? >>>>>> wake_atomic_t_function+0x40/0x40 >>>>>> [ 1080.553324] [<ffffffff811fc66e>] __wait_on_buffer+0x2e/0x30 >>>>>> [ 1080.553326] [<ffffffff8129806b>] >>>>>> jbd2_journal_commit_transaction+0x136b/0x1520 >>>>>> [ 1080.553329] [<ffffffff810a1f75>] ? sched_clock_local+0x25/0x90 >>>>>> [ 1080.553331] [<ffffffff8109a7b8>] ? >>>>>>finish_task_switch+0x128/0x170 >>>>>> [ 1080.553333] [<ffffffff8107891f>] ? >>>>>>try_to_del_timer_sync+0x4f/0x70 >>>>>> [ 1080.553334] [<ffffffff8129c5d8>] kjournald2+0xb8/0x240 >>>>>> [ 1080.553336] [<ffffffff810afef0>] ? __wake_up_sync+0x20/0x20 >>>>>> [ 1080.553338] [<ffffffff8129c520>] ? commit_timeout+0x10/0x10 >>>>>> [ 1080.553340] [<ffffffff8108fa79>] kthread+0xc9/0xe0 >>>>>> [ 1080.553343] [<ffffffff8108f9b0>] ? >>>>>>flush_kthread_worker+0xb0/0xb0 >>>>>> [ 1080.553346] [<ffffffff8176827c>] ret_from_fork+0x7c/0xb0 >>>>>> [ 1080.553349] [<ffffffff8108f9b0>] ? >>>>>>flush_kthread_worker+0xb0/0xb0 >>>>>> >>>>>> Thanks, >>>>>> Jeyaganesh. >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>> >>>>> >>>>> -- >>>>> Wido den Hollander >>>>> 42on B.V. >>>>> Ceph trainer and consultant >>>>> >>>>> Phone: +31 (0)20 700 9902 >>>>> Skype: contact42on >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com