Decrepit ceph cluster performance

J David <j.david.lists@xxxxxxxxx> · Sun, 13 Aug 2023 22:18:27 -0400

I have a cluster that is performing *very* poorly.

It has 12 physical servers and 72 400GB SATA Intel DC mixed-use SSD
OSDs (S3700's & S3710's). All servers have bonded 10GB NICs for 20Gbps
aggregate networking to a dedicated switch.  At least one CPU core and
8GB RAM per OSD.  No SMART errors. All SSDs have media wearout
indicators >90% remaining.

The cluster is Quincy 17.2.6 under cephadm.  (It was recently upgraded
from Octopus.)  Its primary use is serving RBD VM block devices for
Proxmox, but it is a standalone cluster.  (The RBD bench above was run
from one of the mon servers, so Proxmox is not at all involved here.)

The rados bench for this is... problematic:

gridfury@f13:~$ rados bench -p docker 60 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_f13_857310
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       220       204   815.924       816   0.0339855   0.0617994
    2      16       252       236   471.942       128   0.0360448   0.0584063
    3      16       431       415   553.261       716   0.0544013     0.11399
    4      16       533       517    516.93       408   0.0536192    0.103342
    5      16       553       537   429.541        80   0.0465437    0.120551
    6      16       675       659   439.271       488   0.0390309     0.13919
    7      16       704       688   393.087       116   0.0344105    0.134758
    8      16       935       919   459.433       924   0.0443338    0.138301
    9      16      1167      1151    511.48       928   0.0407041    0.124401
   10      16      1219      1203   481.129       208   0.0408885    0.121026
   11      16      1219      1203   437.389         0           -    0.121026
   12      16      1333      1317   438.934       228   0.0436965    0.141108
   13      16      1333      1317   405.169         0           -    0.141108
   14      16      1333      1317   376.228         0           -    0.141108
   15      16      1333      1317   351.146         0           -    0.141108
   16      16      1333      1317   329.199         0           -    0.141108
   17      16      1383      1367   321.597        40   0.0738412    0.198283
   18      16      1621      1605   356.612       952   0.0396665    0.179042
   19      16      1880      1864   392.361      1036   0.0349097    0.162829
2023-08-14T01:27:04.758111+0000 min lat: 0.0238844 max lat: 5.29731
avg lat: 0.152446
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16      2111      2095   418.936       924   0.0581478    0.152446
   21      16      2352      2336   444.881       964   0.0639009    0.143467
   22      16      2592      2576   468.288       960   0.0921614    0.136478
   23      16      2828      2812   488.965       944   0.0587983    0.130606
   24      16      3050      3034   505.586       888    0.106586    0.126353
   25      16      3167      3151    504.08       468   0.0606541    0.126778
   26      16      3408      3392   521.763       964   0.0356258     0.12252
   27      16      3644      3628   537.395       944    0.045652    0.118972
   28      16      3844      3828   546.769       800   0.0328637    0.116728
   29      16      3870      3854   531.501       104   0.0418716     0.11621
   30      16      3870      3854   513.784         0           -     0.11621
   31      16      3870      3854    497.21         0           -     0.11621
   32      16      3870      3854   481.673         0           -     0.11621
   33      16      3870      3854   467.076         0           -     0.11621
   34      16      3870      3854   453.339         0           -     0.11621
   35      16      3870      3854   440.386         0           -     0.11621
   36      16      3870      3854   428.152         0           -     0.11621
   37      16      3870      3854   416.581         0           -     0.11621
   38      16      3870      3854   405.618         0           -     0.11621
   39      16      3870      3854   395.218         0           -     0.11621
2023-08-14T01:27:24.761559+0000 min lat: 0.0238844 max lat: 5.29731
avg lat: 0.11621
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      16      3870      3854   385.337         0           -     0.11621
   41      16      3870      3854   375.938         0           -     0.11621
   42      16      3870      3854   366.987         0           -     0.11621
   43      16      3870      3854   358.452         0           -     0.11621
   44      16      3870      3854   350.306         0           -     0.11621
   45      16      3870      3854   342.521         0           -     0.11621
   46      16      3870      3854   335.075         0           -     0.11621
   47      16      3870      3854   327.946         0           -     0.11621
   48      16      3870      3854   321.114         0           -     0.11621
   49      16      3870      3854    314.56         0           -     0.11621
   50      16      3870      3854   308.269         0           -     0.11621
   51      16      3870      3854   302.224         0           -     0.11621
   52      16      3870      3854   296.412         0           -     0.11621
   53      16      3870      3854    290.82         0           -     0.11621
   54      16      3870      3854   285.434         0           -     0.11621
   55      16      3870      3854   280.244         0           -     0.11621
   56      16      3870      3854    275.24         0           -     0.11621
   57      16      3870      3854   270.411         0           -     0.11621
   58      16      3870      3854   265.749         0           -     0.11621
   59      16      3870      3854   261.245         0           -     0.11621
2023-08-14T01:27:44.765065+0000 min lat: 0.0238844 max lat: 5.29731
avg lat: 0.11621
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16      3870      3854    256.89         0           -     0.11621
   61      16      3870      3854   252.679         0           -     0.11621
   62      16      3870      3854   248.603         0           -     0.11621
   63      16      3870      3854   244.657         0           -     0.11621
   64      16      3870      3854   240.835         0           -     0.11621
   65      16      3870      3854   237.129         0           -     0.11621
Total time run:         65.6256
Total writes made:      3870
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     235.884
Stddev Bandwidth:       375.227
Max bandwidth (MB/sec): 1036
Min bandwidth (MB/sec): 0
Average IOPS:           58
Stddev IOPS:            93.8068
Max IOPS:               259
Min IOPS:               0
Average Latency(s):     0.271253
Stddev Latency(s):      2.43809
Max latency(s):         37.7948
Min latency(s):         0.0238844

58 IOPs average is... less than I might hope for from this hardware.
But those long strings of 0's... I feel like 0 IOPS is inarguably too
low.

At the time I ran this, I had ceph status running and saw +laggy pgs
and actually caught a message:

            36 slow ops, oldest one blocked for 37 sec, daemons
[osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]...

So I checked a few osd logs; there is nothing during the test, but
several have stuff immediately after like this:

Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591(
v 110828'143791701 (110828'143791003,110828'143791701]
local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797
les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779
7,110829)/1 luod=0'0 lua=107760'143777660 crt=110828'143791701 lcod
110828'143791700 mlcod 0'0 active mbc={}] start_peering_interval up
[48,2] -> [2], acting [48,2] -> [2], acting_primary 48 -> 2,
up_primary 48 -> 2, role 1 -> 0, features acting 454013832
0759226367 upacting 4540138320759226367
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591(
v 110828'143791701 (110828'143791003,110828'143791701]
local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797
les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779
7,110829)/1 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0
unknown mbc={}] state<Start>: transitioning to Primary
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v
98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376
lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0
lpr=110829 pi=[107797,110829)/1 crt=98878'16 l
cod 0'0 mlcod 0'0 active mbc={}] start_peering_interval up [2,48,14]
-> [2,14], acting [2,48,14] -> [2,14], acting_primary 2 -> 2,
up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367
upacting 4540138320759226367
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v
98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376
lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0
lpr=110829 pi=[107797,110829)/1 crt=98878'16 l
cod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c(
v 110828'58472450 (107939'58468576,110828'58472450]
local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797
les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829
pi=[107797,1
10829)/1 luod=110828'58472444 crt=110828'58472450 lcod 110828'58472443
mlcod 110828'58472443 active mbc={255={}}] start_peering_interval up
[2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary
2 -> 2, role 0 -> 0, features acting 4540138
320759226367 upacting 4540138320759226367
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326(
v 110828'84969926 (110432'84966024,110828'84969926]
local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797
les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829
pi=[107797,1
10829)/1 luod=110828'84969917 crt=110828'84969926 lcod 110828'84969916
mlcod 110828'84969916 active mbc={255={}}] start_peering_interval up
[2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary
2 -> 2, role 0 -> 0, features acting 4540138
320759226367 upacting 4540138320759226367
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c(
v 110828'58472450 (107939'58468576,110828'58472450]
local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797
les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829
pi=[107797,1
10829)/1 crt=110828'58472450 lcod 110828'58472443 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326(
v 110828'84969926 (110432'84966024,110828'84969926]
local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797
les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829
pi=[107797,1
10829)/1 crt=110828'84969926 lcod 110828'84969916 mlcod 0'0 unknown
mbc={}] state<Start>: transitioning to Primary
Aug 14 01:27:57 f13 ceph-osd[651856]: --2-
[v2:192.168.2.213:6804/3119259815,v1:192.168.2.213:6805/3119259815] >>
[v2:192.168.2.225:6808/1596776081,v1:192.168.2.225:6809/1596776081]
conn(0x55c3dc924400 0x55c3b8555b80 crc :-1 s=SESSION_ACCEPTING pgs=167
cs
=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect no
existing connection exists, reseting client
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591(
v 110828'143791701 (110828'143791003,110828'143791701]
local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797
les/c/f=110830/107798/43927 sis=110833 pruub=15.513665199s) [48,2] r
=1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod
110828'143791700 mlcod 0'0 active pruub 22348.017578125s@ mbc={}]
start_peering_interval up [2] -> [48,2], acting [2] -> [48,2],
acting_primary 2 -> 48, up_primary 2 -> 48, role 0 -> 1, features
 acting 4540138320759226367 upacting 4540138320759226367
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591(
v 110828'143791701 (110828'143791003,110828'143791701]
local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797
les/c/f=110830/107798/43927 sis=110833 pruub=15.513574600s) [48,2] r
=1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod
110828'143791700 mlcod 0'0 unknown NOTIFY pruub 22348.017578125s@
mbc={}] state<Start>: transitioning to Stray
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v
98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376
lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833
pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1
10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 active pruub
22348.023437500s@ mbc={}] start_peering_interval up [2,14] ->
[2,48,14], acting [2,14] -> [2,48,14], acting_primary 2 -> 2,
up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367
upacting
4540138320759226367
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v
98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376
lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833
pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1
10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 unknown pruub
22348.023437500s@ mbc={}] state<Start>: transitioning to Primary
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326(
v 110828'84969926 (110432'84966024,110828'84969926]
local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797
les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0
lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod
110828'84969916 mlcod 0'0 active pruub 22348.017578125s@ mbc={}]
start_peering_interval up [2] -> [2,48], acting [2] -> [2,48],
acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting
 4540138320759226367 upacting 4540138320759226367
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326(
v 110828'84969926 (110432'84966024,110828'84969926]
local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797
les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0
lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod
110828'84969916 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}]
state<Start>: transitioning to Primary
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c(
v 110828'58472450 (107939'58468576,110828'58472450]
local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797
les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0
lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod
110828'58472443 mlcod 0'0 active pruub 22348.017578125s@ mbc={}]
start_peering_interval up [2] -> [2,48], acting [2] -> [2,48],
acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting
 4540138320759226367 upacting 4540138320759226367
Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c(
v 110828'58472450 (107939'58468576,110828'58472450]
local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797
les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0
lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod
110828'58472443 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}]
state<Start>: transitioning to Primary

I also captured iostat -x 5 from several of the OSDs during the test,
including one of the ones identified as having blocked IO.  It shows
3-15% utilized with 30-200 IOPS (mostly write).  r_await average 0.22,
w_await average 0.46.

Nothing in dmesg or syslog.

Is there anything to be done to further investigate or remediate this?
This cluster is pretty old and I'm starting to think we should just
scrap it as hopeless.  But that's a lot of hardware to write off if
there's any possible way to fix this.

Thanks for any advice!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx