Re: Ceph cluster performance degrade (radosgw) after running some time

Guang Yang <xiaogang.yang@xxxxxxxxx> · Mon, 30 Dec 2013 15:59:29 -0800 (PST)

Thanks Mark, my comments inline...

Date: Mon, 30 Dec 2013 07:36:56 -0600
From: Mark Nelson <mark.nelson@xxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Ceph cluster performance degrade (radosgw)
    after running some time

On 12/30/2013 05:45 AM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
>    cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
>    health HEALTH_ERR 1 pgs
 inconsistent; 1 scrub errors
>    monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
>    osdmap e129885: 787 osds: 758 up, 758 in
> 
     pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
>    mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt  | sort |
 uniq -c | sort ?nr
>    3586 656,598
>      289 467,629
>      284 598,763
>      279 584,598
>      203 172,598
>      182 598,6
>      155 629,646
>      83 631,598
>      65 631,593
>      21 616,629
>      20 609,671
>      20 609,390
>      13 609,254
>      12 702,629
>      12 629,641
>      11 665,613
>      11 593,724
>      11 361,591
>      10 591,709
>        9 681,609
>        9 609,595
>        9 591,772
>        8 613,662
>        8 575,591
>        7 674,722
>        7 609,603
>        6 585,605
>        5 613,691
>        5 293,629
>        4 774,591
>        4 717,591
>        4 613,776
>        4 538,629
>        4 485,629
>        3 702,641
>        3 608,629
>        3 593,580
>        3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.2888846379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is
 op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd
 perf.
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 37.3    459.9    89.8    4106.9  61.8    1.6      12.2    4.9
> 42.3    545.8    91.8    4296.3  69.7    2.4      17.6    5.2
> 42.0    483.8    93.1   
 4263.6  68.8    1.8      13.3    5.1
> 39.7    425.5    89.4    4327.0  68.5    1.8      14.0    5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads  kbread writes  kbwrite %busy  avgqu  await  svctm
> 34.2    502.6    80.1    3524.3    53.4    1.3    11.8      4.7
> 35.3    560.9    83.7    3742.0    56.0    1.2    9.8      4.7
> 30.4    371.5  78.8    3631.4    52.2    1.7    15.8    4.8
> 33.0    389.4  78.8     
 3597.6  54.2    1.4      12.1    4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?

You may want to use the dump_historic_ops command in the admin socket 
for the slow OSDs.  That will give you some clues regarding where the 
ops are hanging up in the OSD.  You can also crank the osd debugging way 
up on that node and search through the logs to see if there are any 
patterns or trends (consistent slowness, pauses, etc).  It may also be 
useful to look and see if that OSD is pegging CPU and if so
 attach 
strace or perf to it and see what it's doing.
[yguang] We have a job dump_historic_ops but unfortunately it wasn't running at the time (my bad), and as we are using as a pre-production system, the client is intelligent enough to stop itself when it detected such a increase on latency, so that we are not able to inject the debug argument to reproduce.

In terms of the CPU usage.. this is the sar
 result from the host, there is an increase of the CPU usage (especially user) which matches the latency increased we observed..
            | CPU Share (%)               
Time          user  nice   sys  intr  idle iowait
12/25-11:05   13.4   0.0   4.6   0.0  80.7   1.2
12/25-11:10   13.3   0.0   4.7   0.0
  80.4   1.5
12/25-11:15   13.9   0.0   4.9   0.0  79.5   1.7
12/25-11:20   12.7   0.0   4.5   0.0  81.7   1.0
[omit some items...]
12/25-14:45   20.3   0.0   6.0   0.0  71.3   2.5
12/25-14:50   19.2   0.0   6.0   0.0  72.2  
 2.5
12/25-14:55   20.1   0.0   6.1   0.0  71.6   2.2
12/25-15:00   20.6   0.0   6.1   0.0  70.5   2.8
12/25-15:05   19.9   0.0   6.0   0.0  71.6   2.5
12/25-15:10   20.2   0.0   6.1   0.0  71.2   2.5
12/25-15:15   21.5   0.0
   6.4   0.0  69.5   2.6
12/25-15:20   21.5   0.0   6.3   0.0  69.4   2.7
12/25-15:25   19.2   0.0   5.8   0.0  71.7   3.3
12/25-15:30   21.7   0.0   6.6   0.0  69.0   2.7
[omit some items...]
12/25-16:50   20.6   0.0   6.4   0.0  69.8  
 3.1
12/25-16:55   20.4   0.0   6.3   0.0  70.3   3.0
12/25-17:00   21.4   0.0   6.5   0.0  68.9   3.2
12/25-17:05   22.0   0.0   6.4   0.0  68.7   2.9
12/25-17:10   22.0   0.0   6.5   0.0  68.5   3.0

We may make the artificial load to see
 if things can be reproduced with higher log levels, any other suggestion I may try / provide to locate the issue? Thanks very much.

Normally in this situation I'd say to be wary of the disk going bad, but 
in this case it may be something else.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com