Re: Ceph cluster performance degrade (radosgw) after running some time

Wido den Hollander <wido@xxxxxxxx> · Mon, 30 Dec 2013 14:04:35 +0100

On 12/30/2013 12:45 PM, Guang wrote:
Hi ceph-users and ceph-devel,
Merry Christmas and Happy New Year!

We have a ceph cluster with radosgw, our customer is using S3 API to
access the cluster.

The basic information of the cluster is:
bash-4.1$ ceph -s
   cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
    health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
    monmap e1: 3 mons at
{osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
election epoch 40, quorum 0,1,2 osd151,osd152,osd153
    osdmap e129885: 787 osds: 758 up, 758 in
     pgmap v1884502: 22203 pgs: 22125 active+clean, 1
active+clean+scrubbing, 1 active+clean+inconsistent, 76
active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
TB avail
    mdsmap e1: 0/0/1 up

#When the latency peak happened, there was no scrubbing, recovering or
backfilling at the moment.#

While the performance of the cluster (only with WRITE traffic) is stable
until Dec 25th, our monitoring (for radosgw access log) shows a
significant increase of average latency and 99% latency.

And then I chose one OSD and try to grep slow requests logs and find
that most of the slow requests were waiting for subop, I take osd22 for
example.

osd[561-571] are hosted by osd22.
-bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
~/slow_osd.txt
-bash-4.1$ cat ~/slow_osd.txt  | sort | uniq -c | sort –nr
    3586 656,598
     289 467,629
     284 598,763
     279 584,598
     203 172,598
     182 598,6
     155 629,646
      83 631,598
      65 631,593
      21 616,629
      20 609,671
      20 609,390
      13 609,254
      12 702,629
      12 629,641
      11 665,613
      11 593,724
      11 361,591
      10 591,709
       9 681,609
       9 609,595
       9 591,772
       8 613,662
       8 575,591
       7 674,722
       7 609,603
       6 585,605
       5 613,691
       5 293,629
       4 774,591
       4 717,591
       4 613,776
       4 538,629
       4 485,629
       3 702,641
       3 608,629
       3 593,580
       3 591,676

It turns out most of the slow requests were waiting for osd 598, 629, I
ran the procedure on another host osd22 and got the same pattern.

Then I turned to the host having osd598 and dump the perf counter to do
comparision.

-bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
/var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
op_latency,subop_latency,total_ops
0.192097526753471,0.0344513450167198,7549045
1.99137797628122,1.42198426157216,9184472
0.198062399664129,0.0387090378926376,6305973
0.621697271315762,0.396549768986993,9726679
29.5222496247375,18.2888846379615, 10860858
0.229250239525916,0.0557482067611005,8149691
0.208981698303654,0.0375553180438224,6623842
0.47474766302086,0.292583928601509,9838777
0.339477790083925,0.101288409388438,9340212
0.186448840141895,0.0327296517417626,7081410
0.807598201207144,0.0139762289702332,6093531
(osd 598 is op hotspot as well)

This double confirmed that osd 598 was having some performance issues
(it has around *30 seconds average op latency*!).
sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
latency difference is not as significant as we saw from osd perf.
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
37.3    459.9    89.8     4106.9   61.8     1.6       12.2    4.9
42.3    545.8    91.8     4296.3   69.7     2.4       17.6    5.2
42.0    483.8    93.1     4263.6   68.8     1.8       13.3    5.1
39.7    425.5    89.4     4327.0   68.5     1.8       14.0    5.3

Another disk at the same time for comparison (/dev/sdb).
reads   kbread writes  kbwrite %busy  avgqu  await  svctm
34.2    502.6    80.1    3524.3    53.4     1.3     11.8      4.7
35.3    560.9    83.7    3742.0    56.0     1.2     9.8       4.7
30.4    371.5   78.8     3631.4    52.2     1.7     15.8     4.8
33.0    389.4   78.8      3597.6   54.2     1.4      12.1    4.8

Any idea why a couple of OSDs are so slow that impact the performance of
the entire cluster?

What filesystem are you using? Btrfs or XFS?

Btrfs still suffers from a performance degradation over time. So if you 
run btrfs, that might be the problem.

Wido

Thanks,
Guang

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com