Thanks all for the help.
We finally identified the root cause of the issue was due to a lock contention happening at folder splitting and here is a tracking ticket (thanks Inktank for the fix!): http://tracker.ceph.com/issues/7207
Thanks,
Guang
On Tuesday, December 31, 2013 8:22 AM, Guang Yang
<yguang11@xxxxxxxxx> wrote:
Thanks Wido, my comments inline...
>Date: Mon, 30 Dec 2013 14:04:35 +0100
>From: Wido den Hollander <wido@xxxxxxxx>
>To: ceph-users@xxxxxxxxxxxxxx
>Subject: Re: Ceph cluster performance degrade (radosgw)
> after running some time
>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
> osdmap e129885: 787 osds: 758 up, 758 in
> pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
> mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr
> 3586 656,598
> 289 467,629
> 284 598,763
> 279 584,598
> 203 172,598
> 182 598,6
> 155 629,646
> 83 631,598
> 65 631,593
> 21 616,629
> 20 609,671
> 20 609,390
> 13 609,254
> 12 702,629
> 12 629,641
> 11 665,613
> 11 593,724
> 11 361,591
> 10 591,709
> 9 681,609
> 9 609,595
> 9 591,772
> 8 613,662
> 8 575,591
> 7 674,722
> 7 609,603
> 6 585,605
> 5 613,691
> 5 293,629
> 4 774,591
> 4 717,591
> 4 613,776
> 4 538,629
> 4 485,629
> 3 702,641
> 3 608,629
> 3 593,580
> 3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.2888846379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads kbread writes kbwrite %busy avgqu await svctm
> 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9
> 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2
> 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1
> 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads kbread writes kbwrite %busy avgqu await svctm
> 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7
> 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7
> 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8
> 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>
What filesystem are you using? Btrfs or XFS?
Btrfs still suffers from a performance degradation over time. So if you
run btrfs, that might be the problem.
[yguang] We are running on xfs, journal and data share the same disk with different partitions.
Wido
> Thanks,
>From: Wido den Hollander <wido@xxxxxxxx>
>To: ceph-users@xxxxxxxxxxxxxx
>Subject: Re: Ceph cluster performance degrade (radosgw)
> after running some time
>On 12/30/2013 12:45 PM, Guang wrote:
> Hi ceph-users and ceph-devel,
> Merry Christmas and Happy New Year!
>
> We have a ceph cluster with radosgw, our customer is using S3 API to
> access the cluster.
>
> The basic information of the cluster is:
> bash-4.1$ ceph -s
> cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2
> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> monmap e1: 3 mons at
> {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0},
> election epoch 40, quorum 0,1,2 osd151,osd152,osd153
> osdmap e129885: 787 osds: 758 up, 758 in
> pgmap v1884502: 22203 pgs: 22125 active+clean, 1
> active+clean+scrubbing, 1 active+clean+inconsistent, 76
> active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065
> TB avail
> mdsmap e1: 0/0/1 up
>
> #When the latency peak happened, there was no scrubbing, recovering or
> backfilling at the moment.#
>
> While the performance of the cluster (only with WRITE traffic) is stable
> until Dec 25th, our monitoring (for radosgw access log) shows a
> significant increase of average latency and 99% latency.
>
> And then I chose one OSD and try to grep slow requests logs and find
> that most of the slow requests were waiting for subop, I take osd22 for
> example.
>
> osd[561-571] are hosted by osd22.
> -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log |
> grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done >
> ~/slow_osd.txt
> -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr
> 3586 656,598
> 289 467,629
> 284 598,763
> 279 584,598
> 203 172,598
> 182 598,6
> 155 629,646
> 83 631,598
> 65 631,593
> 21 616,629
> 20 609,671
> 20 609,390
> 13 609,254
> 12 702,629
> 12 629,641
> 11 665,613
> 11 593,724
> 11 361,591
> 10 591,709
> 9 681,609
> 9 609,595
> 9 591,772
> 8 613,662
> 8 575,591
> 7 674,722
> 7 609,603
> 6 585,605
> 5 613,691
> 5 293,629
> 4 774,591
> 4 717,591
> 4 613,776
> 4 538,629
> 4 485,629
> 3 702,641
> 3 608,629
> 3 593,580
> 3 591,676
>
> It turns out most of the slow requests were waiting for osd 598, 629, I
> ran the procedure on another host osd22 and got the same pattern.
>
> Then I turned to the host having osd598 and dump the perf counter to do
> comparision.
>
> -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon
> /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done
> op_latency,subop_latency,total_ops
> 0.192097526753471,0.0344513450167198,7549045
> 1.99137797628122,1.42198426157216,9184472
> 0.198062399664129,0.0387090378926376,6305973
> 0.621697271315762,0.396549768986993,9726679
> 29.5222496247375,18.2888846379615, 10860858
> 0.229250239525916,0.0557482067611005,8149691
> 0.208981698303654,0.0375553180438224,6623842
> 0.47474766302086,0.292583928601509,9838777
> 0.339477790083925,0.101288409388438,9340212
> 0.186448840141895,0.0327296517417626,7081410
> 0.807598201207144,0.0139762289702332,6093531
> (osd 598 is op hotspot as well)
>
> This double confirmed that osd 598 was having some performance issues
> (it has around *30 seconds average op latency*!).
> sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the
> latency difference is not as significant as we saw from osd perf.
> reads kbread writes kbwrite %busy avgqu await svctm
> 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9
> 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2
> 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1
> 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3
>
> Another disk at the same time for comparison (/dev/sdb).
> reads kbread writes kbwrite %busy avgqu await svctm
> 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7
> 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7
> 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8
> 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8
>
> Any idea why a couple of OSDs are so slow that impact the performance of
> the entire cluster?
>
What filesystem are you using? Btrfs or XFS?
Btrfs still suffers from a performance degradation over time. So if you
run btrfs, that might be the problem.
[yguang] We are running on xfs, journal and data share the same disk with different partitions.
Wido
> Thanks,
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com