Re: Ceph cluster NO read / write performance :: Ops are blocked

Shinobu Kinjo <skinjo@xxxxxxxxxx> · Fri, 11 Sep 2015 09:32:27 -0400 (EDT)

If you really want to improve performance of *distributed* filesystem
like Ceph, Lustre, GPFS,
you must consider from networking of the linux kernel.

 L5: Socket
 L4: TCP
 L3: IP
 L2: Queuing

In this discussion, problem could be in L2 which is queuing in descriptor.
We may have to take a closer look at qdisc, if qlen is good enough or not.

But this case:

> 399 16 32445 32429 325.054 84 0.0233839 0.193655
 to
> 400 16 32445 32429 324.241 0 - 0.193655

probably different story -;

> needless to say, very strange. 

Yes, it is quite strange like my English...

Shinobu

----- Original Message -----
From: "Vickey Singh" <vickey.singh22693@xxxxxxxxx>
To: "Jan Schermer" <jan@xxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Thursday, September 10, 2015 2:22:22 AM
Subject: Re:  Ceph cluster NO read / write performance :: Ops	are blocked

Hello Jan 

On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer < jan@xxxxxxxxxxx > wrote: 

Just to recapitulate - the nodes are doing "nothing" when it drops to zero? Not flushing something to drives (iostat)? Not cleaning pagecache (kswapd and similiar)? Not out of any type of memory (slab, min_free_kbytes)? Not network link errors, no bad checksums (those are hard to spot, though)? 

Unless you find something I suggest you try disabling offloads on the NICs and see if the problem goes away. 

Could you please elaborate this point , how do you disable / offload on the NIC ? what does it mean ? how to do it ? how its gonna help. 

Sorry i don't know about this. 

- Vickey - 

Jan 

> On 08 Sep 2015, at 18:26, Lincoln Bryant < lincolnb@xxxxxxxxxxxx > wrote: 
> 
> For whatever it’s worth, my problem has returned and is very similar to yours. Still trying to figure out what’s going on over here. 
> 
> Performance is nice for a few seconds, then goes to 0. This is a similar setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc) 
> 
> 384 16 29520 29504 307.287 1188 0.0492006 0.208259 
> 385 16 29813 29797 309.532 1172 0.0469708 0.206731 
> 386 16 30105 30089 311.756 1168 0.0375764 0.205189 
> 387 16 30401 30385 314.009 1184 0.036142 0.203791 
> 388 16 30695 30679 316.231 1176 0.0372316 0.202355 
> 389 16 30987 30971 318.42 1168 0.0660476 0.200962 
> 390 16 31282 31266 320.628 1180 0.0358611 0.199548 
> 391 16 31568 31552 322.734 1144 0.0405166 0.198132 
> 392 16 31857 31841 324.859 1156 0.0360826 0.196679 
> 393 16 32090 32074 326.404 932 0.0416869 0.19549 
> 394 16 32205 32189 326.743 460 0.0251877 0.194896 
> 395 16 32302 32286 326.897 388 0.0280574 0.194395 
> 396 16 32348 32332 326.537 184 0.0256821 0.194157 
> 397 16 32385 32369 326.087 148 0.0254342 0.193965 
> 398 16 32424 32408 325.659 156 0.0263006 0.193763 
> 399 16 32445 32429 325.054 84 0.0233839 0.193655 
> 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat: 0.193655 
> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
> 400 16 32445 32429 324.241 0 - 0.193655 
> 401 16 32445 32429 323.433 0 - 0.193655 
> 402 16 32445 32429 322.628 0 - 0.193655 
> 403 16 32445 32429 321.828 0 - 0.193655 
> 404 16 32445 32429 321.031 0 - 0.193655 
> 405 16 32445 32429 320.238 0 - 0.193655 
> 406 16 32445 32429 319.45 0 - 0.193655 
> 407 16 32445 32429 318.665 0 - 0.193655 
> 
> needless to say, very strange. 
> 
> —Lincoln 
> 
> 
>> On Sep 7, 2015, at 3:35 PM, Vickey Singh < vickey.singh22693@xxxxxxxxx > wrote: 
>> 
>> Adding ceph-users. 
>> 
>> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh < vickey.singh22693@xxxxxxxxx > wrote: 
>> 
>> 
>> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke < ulembke@xxxxxxxxxxxx > wrote: 
>> Hi Vickey, 
>> Thanks for your time in replying to my problem. 
>> 
>> I had the same rados bench output after changing the motherboard of the monitor node with the lowest IP... 
>> Due to the new mainboard, I assume the hw-clock was wrong during startup. Ceph health show no errors, but all VMs aren't able to do IO (very high load on the VMs - but no traffic). 
>> I stopped the mon, but this don't changed anything. I had to restart all other mons to get IO again. After that I started the first mon also (with the right time now) and all worked fine again... 
>> 
>> Thanks i will try to restart all OSD / MONS and report back , if it solves my problem 
>> 
>> Another posibility: 
>> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage collection? 
>> 
>> No i don't have journals on SSD , they are on the same OSD disk. 
>> 
>> 
>> 
>> Udo 
>> 
>> 
>> On 07.09.2015 16:36, Vickey Singh wrote: 
>>> Dear Experts 
>>> 
>>> Can someone please help me , why my cluster is not able write data. 
>>> 
>>> See the below output cur MB/S is 0 and Avg MB/s is decreasing. 
>>> 
>>> 
>>> Ceph Hammer 0.94.2 
>>> CentOS 6 (3.10.69-1) 
>>> 
>>> The Ceph status says OPS are blocked , i have tried checking , what all i know 
>>> 
>>> - System resources ( CPU , net, disk , memory ) -- All normal 
>>> - 10G network for public and cluster network -- no saturation 
>>> - Add disks are physically healthy 
>>> - No messages in /var/log/messages OR dmesg 
>>> - Tried restarting OSD which are blocking operation , but no luck 
>>> - Tried writing through RBD and Rados bench , both are giving same problemm 
>>> 
>>> Please help me to fix this problem. 
>>> 
>>> # rados bench -p rbd 60 write 
>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects 
>>> Object prefix: benchmark_data_stor1_1791844 
>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
>>> 0 0 0 0 0 0 - 0 
>>> 1 16 125 109 435.873 436 0.022076 0.0697864 
>>> 2 16 139 123 245.948 56 0.246578 0.0674407 
>>> 3 16 139 123 163.969 0 - 0.0674407 
>>> 4 16 139 123 122.978 0 - 0.0674407 
>>> 5 16 139 123 98.383 0 - 0.0674407 
>>> 6 16 139 123 81.9865 0 - 0.0674407 
>>> 7 16 139 123 70.2747 0 - 0.0674407 
>>> 8 16 139 123 61.4903 0 - 0.0674407 
>>> 9 16 139 123 54.6582 0 - 0.0674407 
>>> 10 16 139 123 49.1924 0 - 0.0674407 
>>> 11 16 139 123 44.7201 0 - 0.0674407 
>>> 12 16 139 123 40.9934 0 - 0.0674407 
>>> 13 16 139 123 37.8401 0 - 0.0674407 
>>> 14 16 139 123 35.1373 0 - 0.0674407 
>>> 15 16 139 123 32.7949 0 - 0.0674407 
>>> 16 16 139 123 30.7451 0 - 0.0674407 
>>> 17 16 139 123 28.9364 0 - 0.0674407 
>>> 18 16 139 123 27.3289 0 - 0.0674407 
>>> 19 16 139 123 25.8905 0 - 0.0674407 
>>> 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 
>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
>>> 20 16 139 123 24.596 0 - 0.0674407 
>>> 21 16 139 123 23.4247 0 - 0.0674407 
>>> 22 16 139 123 22.36 0 - 0.0674407 
>>> 23 16 139 123 21.3878 0 - 0.0674407 
>>> 24 16 139 123 20.4966 0 - 0.0674407 
>>> 25 16 139 123 19.6768 0 - 0.0674407 
>>> 26 16 139 123 18.92 0 - 0.0674407 
>>> 27 16 139 123 18.2192 0 - 0.0674407 
>>> 28 16 139 123 17.5686 0 - 0.0674407 
>>> 29 16 139 123 16.9628 0 - 0.0674407 
>>> 30 16 139 123 16.3973 0 - 0.0674407 
>>> 31 16 139 123 15.8684 0 - 0.0674407 
>>> 32 16 139 123 15.3725 0 - 0.0674407 
>>> 33 16 139 123 14.9067 0 - 0.0674407 
>>> 34 16 139 123 14.4683 0 - 0.0674407 
>>> 35 16 139 123 14.0549 0 - 0.0674407 
>>> 36 16 139 123 13.6645 0 - 0.0674407 
>>> 37 16 139 123 13.2952 0 - 0.0674407 
>>> 38 16 139 123 12.9453 0 - 0.0674407 
>>> 39 16 139 123 12.6134 0 - 0.0674407 
>>> 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat: 0.0674407 
>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 
>>> 40 16 139 123 12.2981 0 - 0.0674407 
>>> 41 16 139 123 11.9981 0 - 0.0674407 
>>> 
>>> 
>>> 
>>> 
>>> cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8 
>>> health HEALTH_WARN 
>>> 1 requests are blocked > 32 sec 
>>> monmap e3: 3 mons at {stor0111= 10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011 
>>> 5= 10.100.1.115:6789/0 } 
>>> election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115 
>>> osdmap e19536: 50 osds: 50 up, 50 in 
>>> pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects 
>>> 91513 GB used, 47642 GB / 135 TB avail 
>>> 2752 active+clean 
>>> 
>>> 
>>> Tried using RBD 
>>> 
>>> 
>>> # dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct 
>>> 10000+0 records in 
>>> 10000+0 records out 
>>> 40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s 
>>> 
>>> # dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct 
>>> 100+0 records in 
>>> 100+0 records out 
>>> 104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s 
>>> 
>>> # dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct 
>>> 1+0 records in 
>>> 1+0 records out 
>>> 1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s 
>>> ]# 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> 
>>> ceph-users@xxxxxxxxxxxxxx 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users@xxxxxxxxxxxxxx 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com