I have a cluster that is performing *very* poorly. It has 12 physical servers and 72 400GB SATA Intel DC mixed-use SSD OSDs (S3700's & S3710's). All servers have bonded 10GB NICs for 20Gbps aggregate networking to a dedicated switch. At least one CPU core and 8GB RAM per OSD. No SMART errors. All SSDs have media wearout indicators >90% remaining. The cluster is Quincy 17.2.6 under cephadm. (It was recently upgraded from Octopus.) Its primary use is serving RBD VM block devices for Proxmox, but it is a standalone cluster. (The RBD bench above was run from one of the mon servers, so Proxmox is not at all involved here.) The rados bench for this is... problematic: gridfury@f13:~$ rados bench -p docker 60 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects Object prefix: benchmark_data_f13_857310 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 220 204 815.924 816 0.0339855 0.0617994 2 16 252 236 471.942 128 0.0360448 0.0584063 3 16 431 415 553.261 716 0.0544013 0.11399 4 16 533 517 516.93 408 0.0536192 0.103342 5 16 553 537 429.541 80 0.0465437 0.120551 6 16 675 659 439.271 488 0.0390309 0.13919 7 16 704 688 393.087 116 0.0344105 0.134758 8 16 935 919 459.433 924 0.0443338 0.138301 9 16 1167 1151 511.48 928 0.0407041 0.124401 10 16 1219 1203 481.129 208 0.0408885 0.121026 11 16 1219 1203 437.389 0 - 0.121026 12 16 1333 1317 438.934 228 0.0436965 0.141108 13 16 1333 1317 405.169 0 - 0.141108 14 16 1333 1317 376.228 0 - 0.141108 15 16 1333 1317 351.146 0 - 0.141108 16 16 1333 1317 329.199 0 - 0.141108 17 16 1383 1367 321.597 40 0.0738412 0.198283 18 16 1621 1605 356.612 952 0.0396665 0.179042 19 16 1880 1864 392.361 1036 0.0349097 0.162829 2023-08-14T01:27:04.758111+0000 min lat: 0.0238844 max lat: 5.29731 avg lat: 0.152446 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 20 16 2111 2095 418.936 924 0.0581478 0.152446 21 16 2352 2336 444.881 964 0.0639009 0.143467 22 16 2592 2576 468.288 960 0.0921614 0.136478 23 16 2828 2812 488.965 944 0.0587983 0.130606 24 16 3050 3034 505.586 888 0.106586 0.126353 25 16 3167 3151 504.08 468 0.0606541 0.126778 26 16 3408 3392 521.763 964 0.0356258 0.12252 27 16 3644 3628 537.395 944 0.045652 0.118972 28 16 3844 3828 546.769 800 0.0328637 0.116728 29 16 3870 3854 531.501 104 0.0418716 0.11621 30 16 3870 3854 513.784 0 - 0.11621 31 16 3870 3854 497.21 0 - 0.11621 32 16 3870 3854 481.673 0 - 0.11621 33 16 3870 3854 467.076 0 - 0.11621 34 16 3870 3854 453.339 0 - 0.11621 35 16 3870 3854 440.386 0 - 0.11621 36 16 3870 3854 428.152 0 - 0.11621 37 16 3870 3854 416.581 0 - 0.11621 38 16 3870 3854 405.618 0 - 0.11621 39 16 3870 3854 395.218 0 - 0.11621 2023-08-14T01:27:24.761559+0000 min lat: 0.0238844 max lat: 5.29731 avg lat: 0.11621 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 40 16 3870 3854 385.337 0 - 0.11621 41 16 3870 3854 375.938 0 - 0.11621 42 16 3870 3854 366.987 0 - 0.11621 43 16 3870 3854 358.452 0 - 0.11621 44 16 3870 3854 350.306 0 - 0.11621 45 16 3870 3854 342.521 0 - 0.11621 46 16 3870 3854 335.075 0 - 0.11621 47 16 3870 3854 327.946 0 - 0.11621 48 16 3870 3854 321.114 0 - 0.11621 49 16 3870 3854 314.56 0 - 0.11621 50 16 3870 3854 308.269 0 - 0.11621 51 16 3870 3854 302.224 0 - 0.11621 52 16 3870 3854 296.412 0 - 0.11621 53 16 3870 3854 290.82 0 - 0.11621 54 16 3870 3854 285.434 0 - 0.11621 55 16 3870 3854 280.244 0 - 0.11621 56 16 3870 3854 275.24 0 - 0.11621 57 16 3870 3854 270.411 0 - 0.11621 58 16 3870 3854 265.749 0 - 0.11621 59 16 3870 3854 261.245 0 - 0.11621 2023-08-14T01:27:44.765065+0000 min lat: 0.0238844 max lat: 5.29731 avg lat: 0.11621 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 60 16 3870 3854 256.89 0 - 0.11621 61 16 3870 3854 252.679 0 - 0.11621 62 16 3870 3854 248.603 0 - 0.11621 63 16 3870 3854 244.657 0 - 0.11621 64 16 3870 3854 240.835 0 - 0.11621 65 16 3870 3854 237.129 0 - 0.11621 Total time run: 65.6256 Total writes made: 3870 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 235.884 Stddev Bandwidth: 375.227 Max bandwidth (MB/sec): 1036 Min bandwidth (MB/sec): 0 Average IOPS: 58 Stddev IOPS: 93.8068 Max IOPS: 259 Min IOPS: 0 Average Latency(s): 0.271253 Stddev Latency(s): 2.43809 Max latency(s): 37.7948 Min latency(s): 0.0238844 58 IOPs average is... less than I might hope for from this hardware. But those long strings of 0's... I feel like 0 IOPS is inarguably too low. At the time I ran this, I had ceph status running and saw +laggy pgs and actually caught a message: 36 slow ops, oldest one blocked for 37 sec, daemons [osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]... So I checked a few osd logs; there is nothing during the test, but several have stuff immediately after like this: Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591( v 110828'143791701 (110828'143791003,110828'143791701] local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797 les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779 7,110829)/1 luod=0'0 lua=107760'143777660 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0 active mbc={}] start_peering_interval up [48,2] -> [2], acting [48,2] -> [2], acting_primary 48 -> 2, up_primary 48 -> 2, role 1 -> 0, features acting 454013832 0759226367 upacting 4540138320759226367 Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591( v 110828'143791701 (110828'143791003,110828'143791701] local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797 les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779 7,110829)/1 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v 98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376 lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0 lpr=110829 pi=[107797,110829)/1 crt=98878'16 l cod 0'0 mlcod 0'0 active mbc={}] start_peering_interval up [2,48,14] -> [2,14], acting [2,48,14] -> [2,14], acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367 upacting 4540138320759226367 Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v 98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376 lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0 lpr=110829 pi=[107797,110829)/1 crt=98878'16 l cod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c( v 110828'58472450 (107939'58468576,110828'58472450] local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797 les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829 pi=[107797,1 10829)/1 luod=110828'58472444 crt=110828'58472450 lcod 110828'58472443 mlcod 110828'58472443 active mbc={255={}}] start_peering_interval up [2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting 4540138 320759226367 upacting 4540138320759226367 Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326( v 110828'84969926 (110432'84966024,110828'84969926] local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797 les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[107797,1 10829)/1 luod=110828'84969917 crt=110828'84969926 lcod 110828'84969916 mlcod 110828'84969916 active mbc={255={}}] start_peering_interval up [2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting 4540138 320759226367 upacting 4540138320759226367 Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c( v 110828'58472450 (107939'58468576,110828'58472450] local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797 les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829 pi=[107797,1 10829)/1 crt=110828'58472450 lcod 110828'58472443 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326( v 110828'84969926 (110432'84966024,110828'84969926] local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797 les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[107797,1 10829)/1 crt=110828'84969926 lcod 110828'84969916 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary Aug 14 01:27:57 f13 ceph-osd[651856]: --2- [v2:192.168.2.213:6804/3119259815,v1:192.168.2.213:6805/3119259815] >> [v2:192.168.2.225:6808/1596776081,v1:192.168.2.225:6809/1596776081] conn(0x55c3dc924400 0x55c3b8555b80 crc :-1 s=SESSION_ACCEPTING pgs=167 cs =0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect no existing connection exists, reseting client Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591( v 110828'143791701 (110828'143791003,110828'143791701] local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797 les/c/f=110830/107798/43927 sis=110833 pruub=15.513665199s) [48,2] r =1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0 active pruub 22348.017578125s@ mbc={}] start_peering_interval up [2] -> [48,2], acting [2] -> [48,2], acting_primary 2 -> 48, up_primary 2 -> 48, role 0 -> 1, features acting 4540138320759226367 upacting 4540138320759226367 Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591( v 110828'143791701 (110828'143791003,110828'143791701] local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797 les/c/f=110830/107798/43927 sis=110833 pruub=15.513574600s) [48,2] r =1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0 unknown NOTIFY pruub 22348.017578125s@ mbc={}] state<Start>: transitioning to Stray Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v 98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376 lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833 pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1 10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 active pruub 22348.023437500s@ mbc={}] start_peering_interval up [2,14] -> [2,48,14], acting [2,14] -> [2,48,14], acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367 upacting 4540138320759226367 Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v 98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376 lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833 pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1 10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 unknown pruub 22348.023437500s@ mbc={}] state<Start>: transitioning to Primary Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326( v 110828'84969926 (110432'84966024,110828'84969926] local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797 les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0 lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod 110828'84969916 mlcod 0'0 active pruub 22348.017578125s@ mbc={}] start_peering_interval up [2] -> [2,48], acting [2] -> [2,48], acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367 upacting 4540138320759226367 Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326( v 110828'84969926 (110432'84966024,110828'84969926] local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797 les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0 lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod 110828'84969916 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}] state<Start>: transitioning to Primary Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c( v 110828'58472450 (107939'58468576,110828'58472450] local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797 les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0 lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod 110828'58472443 mlcod 0'0 active pruub 22348.017578125s@ mbc={}] start_peering_interval up [2] -> [2,48], acting [2] -> [2,48], acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367 upacting 4540138320759226367 Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c( v 110828'58472450 (107939'58468576,110828'58472450] local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797 les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0 lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod 110828'58472443 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}] state<Start>: transitioning to Primary I also captured iostat -x 5 from several of the OSDs during the test, including one of the ones identified as having blocked IO. It shows 3-15% utilized with 30-200 IOPS (mostly write). r_await average 0.22, w_await average 0.46. Nothing in dmesg or syslog. Is there anything to be done to further investigate or remediate this? This cluster is pretty old and I'm starting to think we should just scrap it as hopeless. But that's a lot of hardware to write off if there's any possible way to fix this. Thanks for any advice! _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx