Re: Ceph cluster performance degrade (radosgw) > after running some time

Guang Yang <xiaogang.yang@xxxxxxxxx> · Sat, 8 Feb 2014 00:35:07 -0800 (PST)

Thanks all for the help.

We finally identified the root cause of the issue was due to a lock contention happening at folder splitting and here is a tracking ticket (thanks Inktank for the fix!): http://tracker.ceph.com/issues/7207

Thanks,
Guang

     On Tuesday, December 31, 2013 8:22 AM, Guang Yang
 <yguang11@xxxxxxxxx> wrote:

  Thanks Wido, my comments inline...

>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander <wido@xxxxxxxx>
>To: ceph-users@xxxxxxxxxxxxxx
>Subject: Re:  Ceph cluster performance degrade (radosgw)
>    after running some time

>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph
 cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2
 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
>      pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only
 with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>   
   11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>     
   5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>   
     3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.2888846379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
>
 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
>
 reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1    4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8      3597.6  54.2    1.4 

   12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>

What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

[yguang] We are running on xfs, journal and data share the same disk with different partitions.

Wido

> Thanks,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com