Re: Decrepit ceph cluster performance

Tyler Stachecki <stachecki.tyler@xxxxxxxxx> · Sun, 13 Aug 2023 22:32:40 -0400

On Sun, Aug 13, 2023 at 10:19 PM J David <j.david.lists@xxxxxxxxx> wrote:
>
> I have a cluster that is performing *very* poorly.
>
> It has 12 physical servers and 72 400GB SATA Intel DC mixed-use SSD
> OSDs (S3700's & S3710's). All servers have bonded 10GB NICs for 20Gbps
> aggregate networking to a dedicated switch.  At least one CPU core and
> 8GB RAM per OSD.  No SMART errors. All SSDs have media wearout
> indicators >90% remaining.
>
> The cluster is Quincy 17.2.6 under cephadm.  (It was recently upgraded
> from Octopus.)  Its primary use is serving RBD VM block devices for
> Proxmox, but it is a standalone cluster.  (The RBD bench above was run
> from one of the mon servers, so Proxmox is not at all involved here.)
>
> The rados bench for this is... problematic:
>
> gridfury@f13:~$ rados bench -p docker 60 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 60 seconds or 0 objects
> Object prefix: benchmark_data_f13_857310
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>     0       0         0         0         0         0           -           0
>     1      16       220       204   815.924       816   0.0339855   0.0617994
>     2      16       252       236   471.942       128   0.0360448   0.0584063
>     3      16       431       415   553.261       716   0.0544013     0.11399
>     4      16       533       517    516.93       408   0.0536192    0.103342
>     5      16       553       537   429.541        80   0.0465437    0.120551
>     6      16       675       659   439.271       488   0.0390309     0.13919
>     7      16       704       688   393.087       116   0.0344105    0.134758
>     8      16       935       919   459.433       924   0.0443338    0.138301
>     9      16      1167      1151    511.48       928   0.0407041    0.124401
>    10      16      1219      1203   481.129       208   0.0408885    0.121026
>    11      16      1219      1203   437.389         0           -    0.121026
>    12      16      1333      1317   438.934       228   0.0436965    0.141108
>    13      16      1333      1317   405.169         0           -    0.141108
>    14      16      1333      1317   376.228         0           -    0.141108
>    15      16      1333      1317   351.146         0           -    0.141108
>    16      16      1333      1317   329.199         0           -    0.141108
>    17      16      1383      1367   321.597        40   0.0738412    0.198283
>    18      16      1621      1605   356.612       952   0.0396665    0.179042
>    19      16      1880      1864   392.361      1036   0.0349097    0.162829
> 2023-08-14T01:27:04.758111+0000 min lat: 0.0238844 max lat: 5.29731
> avg lat: 0.152446
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>    20      16      2111      2095   418.936       924   0.0581478    0.152446
>    21      16      2352      2336   444.881       964   0.0639009    0.143467
>    22      16      2592      2576   468.288       960   0.0921614    0.136478
>    23      16      2828      2812   488.965       944   0.0587983    0.130606
>    24      16      3050      3034   505.586       888    0.106586    0.126353
>    25      16      3167      3151    504.08       468   0.0606541    0.126778
>    26      16      3408      3392   521.763       964   0.0356258     0.12252
>    27      16      3644      3628   537.395       944    0.045652    0.118972
>    28      16      3844      3828   546.769       800   0.0328637    0.116728
>    29      16      3870      3854   531.501       104   0.0418716     0.11621
>    30      16      3870      3854   513.784         0           -     0.11621
>    31      16      3870      3854    497.21         0           -     0.11621
>    32      16      3870      3854   481.673         0           -     0.11621
>    33      16      3870      3854   467.076         0           -     0.11621
>    34      16      3870      3854   453.339         0           -     0.11621
>    35      16      3870      3854   440.386         0           -     0.11621
>    36      16      3870      3854   428.152         0           -     0.11621
>    37      16      3870      3854   416.581         0           -     0.11621
>    38      16      3870      3854   405.618         0           -     0.11621
>    39      16      3870      3854   395.218         0           -     0.11621
> 2023-08-14T01:27:24.761559+0000 min lat: 0.0238844 max lat: 5.29731
> avg lat: 0.11621
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>    40      16      3870      3854   385.337         0           -     0.11621
>    41      16      3870      3854   375.938         0           -     0.11621
>    42      16      3870      3854   366.987         0           -     0.11621
>    43      16      3870      3854   358.452         0           -     0.11621
>    44      16      3870      3854   350.306         0           -     0.11621
>    45      16      3870      3854   342.521         0           -     0.11621
>    46      16      3870      3854   335.075         0           -     0.11621
>    47      16      3870      3854   327.946         0           -     0.11621
>    48      16      3870      3854   321.114         0           -     0.11621
>    49      16      3870      3854    314.56         0           -     0.11621
>    50      16      3870      3854   308.269         0           -     0.11621
>    51      16      3870      3854   302.224         0           -     0.11621
>    52      16      3870      3854   296.412         0           -     0.11621
>    53      16      3870      3854    290.82         0           -     0.11621
>    54      16      3870      3854   285.434         0           -     0.11621
>    55      16      3870      3854   280.244         0           -     0.11621
>    56      16      3870      3854    275.24         0           -     0.11621
>    57      16      3870      3854   270.411         0           -     0.11621
>    58      16      3870      3854   265.749         0           -     0.11621
>    59      16      3870      3854   261.245         0           -     0.11621
> 2023-08-14T01:27:44.765065+0000 min lat: 0.0238844 max lat: 5.29731
> avg lat: 0.11621
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>    60      16      3870      3854    256.89         0           -     0.11621
>    61      16      3870      3854   252.679         0           -     0.11621
>    62      16      3870      3854   248.603         0           -     0.11621
>    63      16      3870      3854   244.657         0           -     0.11621
>    64      16      3870      3854   240.835         0           -     0.11621
>    65      16      3870      3854   237.129         0           -     0.11621
> Total time run:         65.6256
> Total writes made:      3870
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     235.884
> Stddev Bandwidth:       375.227
> Max bandwidth (MB/sec): 1036
> Min bandwidth (MB/sec): 0
> Average IOPS:           58
> Stddev IOPS:            93.8068
> Max IOPS:               259
> Min IOPS:               0
> Average Latency(s):     0.271253
> Stddev Latency(s):      2.43809
> Max latency(s):         37.7948
> Min latency(s):         0.0238844
>
> 58 IOPs average is... less than I might hope for from this hardware.
> But those long strings of 0's... I feel like 0 IOPS is inarguably too
> low.
>
> At the time I ran this, I had ceph status running and saw +laggy pgs
> and actually caught a message:
>
>             36 slow ops, oldest one blocked for 37 sec, daemons
> [osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]...
>
> So I checked a few osd logs; there is nothing during the test, but
> several have stuff immediately after like this:
>
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591(
> v 110828'143791701 (110828'143791003,110828'143791701]
> local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797
> les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779
> 7,110829)/1 luod=0'0 lua=107760'143777660 crt=110828'143791701 lcod
> 110828'143791700 mlcod 0'0 active mbc={}] start_peering_interval up
> [48,2] -> [2], acting [48,2] -> [2], acting_primary 48 -> 2,
> up_primary 48 -> 2, role 1 -> 0, features acting 454013832
> 0759226367 upacting 4540138320759226367
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591(
> v 110828'143791701 (110828'143791003,110828'143791701]
> local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797
> les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779
> 7,110829)/1 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0
> unknown mbc={}] state<Start>: transitioning to Primary
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v
> 98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376
> lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0
> lpr=110829 pi=[107797,110829)/1 crt=98878'16 l
> cod 0'0 mlcod 0'0 active mbc={}] start_peering_interval up [2,48,14]
> -> [2,14], acting [2,48,14] -> [2,14], acting_primary 2 -> 2,
> up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367
> upacting 4540138320759226367
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v
> 98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376
> lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0
> lpr=110829 pi=[107797,110829)/1 crt=98878'16 l
> cod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c(
> v 110828'58472450 (107939'58468576,110828'58472450]
> local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797
> les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829
> pi=[107797,1
> 10829)/1 luod=110828'58472444 crt=110828'58472450 lcod 110828'58472443
> mlcod 110828'58472443 active mbc={255={}}] start_peering_interval up
> [2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary
> 2 -> 2, role 0 -> 0, features acting 4540138
> 320759226367 upacting 4540138320759226367
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326(
> v 110828'84969926 (110432'84966024,110828'84969926]
> local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797
> les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829
> pi=[107797,1
> 10829)/1 luod=110828'84969917 crt=110828'84969926 lcod 110828'84969916
> mlcod 110828'84969916 active mbc={255={}}] start_peering_interval up
> [2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary
> 2 -> 2, role 0 -> 0, features acting 4540138
> 320759226367 upacting 4540138320759226367
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c(
> v 110828'58472450 (107939'58468576,110828'58472450]
> local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797
> les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829
> pi=[107797,1
> 10829)/1 crt=110828'58472450 lcod 110828'58472443 mlcod 0'0 unknown
> mbc={}] state<Start>: transitioning to Primary
> Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326(
> v 110828'84969926 (110432'84966024,110828'84969926]
> local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797
> les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829
> pi=[107797,1
> 10829)/1 crt=110828'84969926 lcod 110828'84969916 mlcod 0'0 unknown
> mbc={}] state<Start>: transitioning to Primary
> Aug 14 01:27:57 f13 ceph-osd[651856]: --2-
> [v2:192.168.2.213:6804/3119259815,v1:192.168.2.213:6805/3119259815] >>
> [v2:192.168.2.225:6808/1596776081,v1:192.168.2.225:6809/1596776081]
> conn(0x55c3dc924400 0x55c3b8555b80 crc :-1 s=SESSION_ACCEPTING pgs=167
> cs
> =0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect no
> existing connection exists, reseting client
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591(
> v 110828'143791701 (110828'143791003,110828'143791701]
> local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797
> les/c/f=110830/107798/43927 sis=110833 pruub=15.513665199s) [48,2] r
> =1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod
> 110828'143791700 mlcod 0'0 active pruub 22348.017578125s@ mbc={}]
> start_peering_interval up [2] -> [48,2], acting [2] -> [48,2],
> acting_primary 2 -> 48, up_primary 2 -> 48, role 0 -> 1, features
>  acting 4540138320759226367 upacting 4540138320759226367
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591(
> v 110828'143791701 (110828'143791003,110828'143791701]
> local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797
> les/c/f=110830/107798/43927 sis=110833 pruub=15.513574600s) [48,2] r
> =1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod
> 110828'143791700 mlcod 0'0 unknown NOTIFY pruub 22348.017578125s@
> mbc={}] state<Start>: transitioning to Stray
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v
> 98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376
> lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833
> pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1
> 10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 active pruub
> 22348.023437500s@ mbc={}] start_peering_interval up [2,14] ->
> [2,48,14], acting [2,14] -> [2,48,14], acting_primary 2 -> 2,
> up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367
> upacting
> 4540138320759226367
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v
> 98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376
> lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833
> pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1
> 10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 unknown pruub
> 22348.023437500s@ mbc={}] state<Start>: transitioning to Primary
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326(
> v 110828'84969926 (110432'84966024,110828'84969926]
> local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797
> les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0
> lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod
> 110828'84969916 mlcod 0'0 active pruub 22348.017578125s@ mbc={}]
> start_peering_interval up [2] -> [2,48], acting [2] -> [2,48],
> acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting
>  4540138320759226367 upacting 4540138320759226367
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326(
> v 110828'84969926 (110432'84966024,110828'84969926]
> local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797
> les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0
> lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod
> 110828'84969916 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}]
> state<Start>: transitioning to Primary
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c(
> v 110828'58472450 (107939'58468576,110828'58472450]
> local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797
> les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0
> lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod
> 110828'58472443 mlcod 0'0 active pruub 22348.017578125s@ mbc={}]
> start_peering_interval up [2] -> [2,48], acting [2] -> [2,48],
> acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting
>  4540138320759226367 upacting 4540138320759226367
> Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c(
> v 110828'58472450 (107939'58468576,110828'58472450]
> local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797
> les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0
> lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod
> 110828'58472443 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}]
> state<Start>: transitioning to Primary
>
> I also captured iostat -x 5 from several of the OSDs during the test,
> including one of the ones identified as having blocked IO.  It shows
> 3-15% utilized with 30-200 IOPS (mostly write).  r_await average 0.22,
> w_await average 0.46.
>
> Nothing in dmesg or syslog.
>
> Is there anything to be done to further investigate or remediate this?
> This cluster is pretty old and I'm starting to think we should just
> scrap it as hopeless.  But that's a lot of hardware to write off if
> there's any possible way to fix this.
>
> Thanks for any advice!
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Hi,

Using that command, you'd be doing *4MiB I/Os*. So 232MiB/s or ~2Gbps
*from a single client*. You might want to try smaller I/O sizes and
more clients to assess cluster performance if you are interested in
high IOPS workloads.

The logs you pasted show the OSDs are peering - when they do that,
cluster performance will drop considerably, especially if there is
subsequent backfilling. Did the OSD processes crash? They will not
just peer like that for no reason.

The slow OSDs look like they could all be adjacent - did you inspect
`ceph osd tree` to see if the OSDs all lie under the same host. Do you
have network issues going from your client to that host?

Also, 1 CPU core/OSD is definitely undersized. I'm not sure how much
you have -- but you want at least a couple per OSD for SSD, and even
more for NVMe... especially when it comes to small block write
workloads.

Cheers,
Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx