On Sun, Aug 13, 2023 at 10:19 PM J David <j.david.lists@xxxxxxxxx> wrote: > > I have a cluster that is performing *very* poorly. > > It has 12 physical servers and 72 400GB SATA Intel DC mixed-use SSD > OSDs (S3700's & S3710's). All servers have bonded 10GB NICs for 20Gbps > aggregate networking to a dedicated switch. At least one CPU core and > 8GB RAM per OSD. No SMART errors. All SSDs have media wearout > indicators >90% remaining. > > The cluster is Quincy 17.2.6 under cephadm. (It was recently upgraded > from Octopus.) Its primary use is serving RBD VM block devices for > Proxmox, but it is a standalone cluster. (The RBD bench above was run > from one of the mon servers, so Proxmox is not at all involved here.) > > The rados bench for this is... problematic: > > gridfury@f13:~$ rados bench -p docker 60 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size > 4194304 for up to 60 seconds or 0 objects > Object prefix: benchmark_data_f13_857310 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 16 220 204 815.924 816 0.0339855 0.0617994 > 2 16 252 236 471.942 128 0.0360448 0.0584063 > 3 16 431 415 553.261 716 0.0544013 0.11399 > 4 16 533 517 516.93 408 0.0536192 0.103342 > 5 16 553 537 429.541 80 0.0465437 0.120551 > 6 16 675 659 439.271 488 0.0390309 0.13919 > 7 16 704 688 393.087 116 0.0344105 0.134758 > 8 16 935 919 459.433 924 0.0443338 0.138301 > 9 16 1167 1151 511.48 928 0.0407041 0.124401 > 10 16 1219 1203 481.129 208 0.0408885 0.121026 > 11 16 1219 1203 437.389 0 - 0.121026 > 12 16 1333 1317 438.934 228 0.0436965 0.141108 > 13 16 1333 1317 405.169 0 - 0.141108 > 14 16 1333 1317 376.228 0 - 0.141108 > 15 16 1333 1317 351.146 0 - 0.141108 > 16 16 1333 1317 329.199 0 - 0.141108 > 17 16 1383 1367 321.597 40 0.0738412 0.198283 > 18 16 1621 1605 356.612 952 0.0396665 0.179042 > 19 16 1880 1864 392.361 1036 0.0349097 0.162829 > 2023-08-14T01:27:04.758111+0000 min lat: 0.0238844 max lat: 5.29731 > avg lat: 0.152446 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 20 16 2111 2095 418.936 924 0.0581478 0.152446 > 21 16 2352 2336 444.881 964 0.0639009 0.143467 > 22 16 2592 2576 468.288 960 0.0921614 0.136478 > 23 16 2828 2812 488.965 944 0.0587983 0.130606 > 24 16 3050 3034 505.586 888 0.106586 0.126353 > 25 16 3167 3151 504.08 468 0.0606541 0.126778 > 26 16 3408 3392 521.763 964 0.0356258 0.12252 > 27 16 3644 3628 537.395 944 0.045652 0.118972 > 28 16 3844 3828 546.769 800 0.0328637 0.116728 > 29 16 3870 3854 531.501 104 0.0418716 0.11621 > 30 16 3870 3854 513.784 0 - 0.11621 > 31 16 3870 3854 497.21 0 - 0.11621 > 32 16 3870 3854 481.673 0 - 0.11621 > 33 16 3870 3854 467.076 0 - 0.11621 > 34 16 3870 3854 453.339 0 - 0.11621 > 35 16 3870 3854 440.386 0 - 0.11621 > 36 16 3870 3854 428.152 0 - 0.11621 > 37 16 3870 3854 416.581 0 - 0.11621 > 38 16 3870 3854 405.618 0 - 0.11621 > 39 16 3870 3854 395.218 0 - 0.11621 > 2023-08-14T01:27:24.761559+0000 min lat: 0.0238844 max lat: 5.29731 > avg lat: 0.11621 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 40 16 3870 3854 385.337 0 - 0.11621 > 41 16 3870 3854 375.938 0 - 0.11621 > 42 16 3870 3854 366.987 0 - 0.11621 > 43 16 3870 3854 358.452 0 - 0.11621 > 44 16 3870 3854 350.306 0 - 0.11621 > 45 16 3870 3854 342.521 0 - 0.11621 > 46 16 3870 3854 335.075 0 - 0.11621 > 47 16 3870 3854 327.946 0 - 0.11621 > 48 16 3870 3854 321.114 0 - 0.11621 > 49 16 3870 3854 314.56 0 - 0.11621 > 50 16 3870 3854 308.269 0 - 0.11621 > 51 16 3870 3854 302.224 0 - 0.11621 > 52 16 3870 3854 296.412 0 - 0.11621 > 53 16 3870 3854 290.82 0 - 0.11621 > 54 16 3870 3854 285.434 0 - 0.11621 > 55 16 3870 3854 280.244 0 - 0.11621 > 56 16 3870 3854 275.24 0 - 0.11621 > 57 16 3870 3854 270.411 0 - 0.11621 > 58 16 3870 3854 265.749 0 - 0.11621 > 59 16 3870 3854 261.245 0 - 0.11621 > 2023-08-14T01:27:44.765065+0000 min lat: 0.0238844 max lat: 5.29731 > avg lat: 0.11621 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 60 16 3870 3854 256.89 0 - 0.11621 > 61 16 3870 3854 252.679 0 - 0.11621 > 62 16 3870 3854 248.603 0 - 0.11621 > 63 16 3870 3854 244.657 0 - 0.11621 > 64 16 3870 3854 240.835 0 - 0.11621 > 65 16 3870 3854 237.129 0 - 0.11621 > Total time run: 65.6256 > Total writes made: 3870 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 235.884 > Stddev Bandwidth: 375.227 > Max bandwidth (MB/sec): 1036 > Min bandwidth (MB/sec): 0 > Average IOPS: 58 > Stddev IOPS: 93.8068 > Max IOPS: 259 > Min IOPS: 0 > Average Latency(s): 0.271253 > Stddev Latency(s): 2.43809 > Max latency(s): 37.7948 > Min latency(s): 0.0238844 > > 58 IOPs average is... less than I might hope for from this hardware. > But those long strings of 0's... I feel like 0 IOPS is inarguably too > low. > > At the time I ran this, I had ceph status running and saw +laggy pgs > and actually caught a message: > > 36 slow ops, oldest one blocked for 37 sec, daemons > [osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]... > > So I checked a few osd logs; there is nothing during the test, but > several have stuff immediately after like this: > > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591( > v 110828'143791701 (110828'143791003,110828'143791701] > local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797 > les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779 > 7,110829)/1 luod=0'0 lua=107760'143777660 crt=110828'143791701 lcod > 110828'143791700 mlcod 0'0 active mbc={}] start_peering_interval up > [48,2] -> [2], acting [48,2] -> [2], acting_primary 48 -> 2, > up_primary 48 -> 2, role 1 -> 0, features acting 454013832 > 0759226367 upacting 4540138320759226367 > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.591( > v 110828'143791701 (110828'143791003,110828'143791701] > local-lis/les=107797/107798 n=826 ec=1/1 lis/c=107797/107797 > les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 pi=[10779 > 7,110829)/1 crt=110828'143791701 lcod 110828'143791700 mlcod 0'0 > unknown mbc={}] state<Start>: transitioning to Primary > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v > 98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376 > lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0 > lpr=110829 pi=[107797,110829)/1 crt=98878'16 l > cod 0'0 mlcod 0'0 active mbc={}] start_peering_interval up [2,48,14] > -> [2,14], acting [2,48,14] -> [2,14], acting_primary 2 -> 2, > up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367 > upacting 4540138320759226367 > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[5.b( v > 98878'16 (0'0,98878'16] local-lis/les=107797/107798 n=4 ec=98376/98376 > lis/c=107797/107797 les/c/f=107798/107798/0 sis=110829) [2,14] r=0 > lpr=110829 pi=[107797,110829)/1 crt=98878'16 l > cod 0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c( > v 110828'58472450 (107939'58468576,110828'58472450] > local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797 > les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829 > pi=[107797,1 > 10829)/1 luod=110828'58472444 crt=110828'58472450 lcod 110828'58472443 > mlcod 110828'58472443 active mbc={255={}}] start_peering_interval up > [2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary > 2 -> 2, role 0 -> 0, features acting 4540138 > 320759226367 upacting 4540138320759226367 > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326( > v 110828'84969926 (110432'84966024,110828'84969926] > local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797 > les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 > pi=[107797,1 > 10829)/1 luod=110828'84969917 crt=110828'84969926 lcod 110828'84969916 > mlcod 110828'84969916 active mbc={255={}}] start_peering_interval up > [2,48] -> [2], acting [2,48] -> [2], acting_primary 2 -> 2, up_primary > 2 -> 2, role 0 -> 0, features acting 4540138 > 320759226367 upacting 4540138320759226367 > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.30c( > v 110828'58472450 (107939'58468576,110828'58472450] > local-lis/les=107797/107798 n=840 ec=1/1 lis/c=107797/107797 > les/c/f=107798/107800/43927 sis=110829) [2] r=0 lpr=110829 > pi=[107797,1 > 10829)/1 crt=110828'58472450 lcod 110828'58472443 mlcod 0'0 unknown > mbc={}] state<Start>: transitioning to Primary > Aug 14 01:27:49 f13 ceph-osd[651856]: osd.2 pg_epoch: 110829 pg[2.326( > v 110828'84969926 (110432'84966024,110828'84969926] > local-lis/les=107797/107798 n=831 ec=1/1 lis/c=107797/107797 > les/c/f=107798/107798/43927 sis=110829) [2] r=0 lpr=110829 > pi=[107797,1 > 10829)/1 crt=110828'84969926 lcod 110828'84969916 mlcod 0'0 unknown > mbc={}] state<Start>: transitioning to Primary > Aug 14 01:27:57 f13 ceph-osd[651856]: --2- > [v2:192.168.2.213:6804/3119259815,v1:192.168.2.213:6805/3119259815] >> > [v2:192.168.2.225:6808/1596776081,v1:192.168.2.225:6809/1596776081] > conn(0x55c3dc924400 0x55c3b8555b80 crc :-1 s=SESSION_ACCEPTING pgs=167 > cs > =0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect no > existing connection exists, reseting client > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591( > v 110828'143791701 (110828'143791003,110828'143791701] > local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797 > les/c/f=110830/107798/43927 sis=110833 pruub=15.513665199s) [48,2] r > =1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod > 110828'143791700 mlcod 0'0 active pruub 22348.017578125s@ mbc={}] > start_peering_interval up [2] -> [48,2], acting [2] -> [48,2], > acting_primary 2 -> 48, up_primary 2 -> 48, role 0 -> 1, features > acting 4540138320759226367 upacting 4540138320759226367 > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.591( > v 110828'143791701 (110828'143791003,110828'143791701] > local-lis/les=110829/110830 n=826 ec=1/1 lis/c=110829/107797 > les/c/f=110830/107798/43927 sis=110833 pruub=15.513574600s) [48,2] r > =1 lpr=110833 pi=[107797,110833)/1 crt=110828'143791701 lcod > 110828'143791700 mlcod 0'0 unknown NOTIFY pruub 22348.017578125s@ > mbc={}] state<Start>: transitioning to Stray > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v > 98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376 > lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833 > pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1 > 10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 active pruub > 22348.023437500s@ mbc={}] start_peering_interval up [2,14] -> > [2,48,14], acting [2,14] -> [2,48,14], acting_primary 2 -> 2, > up_primary 2 -> 2, role 0 -> 0, features acting 4540138320759226367 > upacting > 4540138320759226367 > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[5.b( v > 98878'16 (0'0,98878'16] local-lis/les=110829/110830 n=4 ec=98376/98376 > lis/c=110829/107797 les/c/f=110830/107798/0 sis=110833 > pruub=15.519681931s) [2,48,14] r=0 lpr=110833 pi=[107797,1 > 10833)/1 crt=98878'16 lcod 0'0 mlcod 0'0 unknown pruub > 22348.023437500s@ mbc={}] state<Start>: transitioning to Primary > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326( > v 110828'84969926 (110432'84966024,110828'84969926] > local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797 > les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0 > lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod > 110828'84969916 mlcod 0'0 active pruub 22348.017578125s@ mbc={}] > start_peering_interval up [2] -> [2,48], acting [2] -> [2,48], > acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting > 4540138320759226367 upacting 4540138320759226367 > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.326( > v 110828'84969926 (110432'84966024,110828'84969926] > local-lis/les=110829/110830 n=831 ec=1/1 lis/c=110829/107797 > les/c/f=110830/107798/43927 sis=110833 pruub=15.512885094s) [2,48] r=0 > lpr=110833 pi=[107797,110833)/1 crt=110828'84969926 lcod > 110828'84969916 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}] > state<Start>: transitioning to Primary > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c( > v 110828'58472450 (107939'58468576,110828'58472450] > local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797 > les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0 > lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod > 110828'58472443 mlcod 0'0 active pruub 22348.017578125s@ mbc={}] > start_peering_interval up [2] -> [2,48], acting [2] -> [2,48], > acting_primary 2 -> 2, up_primary 2 -> 2, role 0 -> 0, features acting > 4540138320759226367 upacting 4540138320759226367 > Aug 14 01:27:58 f13 ceph-osd[651856]: osd.2 pg_epoch: 110833 pg[2.30c( > v 110828'58472450 (107939'58468576,110828'58472450] > local-lis/les=110829/110830 n=840 ec=1/1 lis/c=110829/107797 > les/c/f=110830/107800/43927 sis=110833 pruub=15.512117386s) [2,48] r=0 > lpr=110833 pi=[107797,110833)/1 crt=110828'58472450 lcod > 110828'58472443 mlcod 0'0 unknown pruub 22348.017578125s@ mbc={}] > state<Start>: transitioning to Primary > > I also captured iostat -x 5 from several of the OSDs during the test, > including one of the ones identified as having blocked IO. It shows > 3-15% utilized with 30-200 IOPS (mostly write). r_await average 0.22, > w_await average 0.46. > > Nothing in dmesg or syslog. > > Is there anything to be done to further investigate or remediate this? > This cluster is pretty old and I'm starting to think we should just > scrap it as hopeless. But that's a lot of hardware to write off if > there's any possible way to fix this. > > Thanks for any advice! > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx Hi, Using that command, you'd be doing *4MiB I/Os*. So 232MiB/s or ~2Gbps *from a single client*. You might want to try smaller I/O sizes and more clients to assess cluster performance if you are interested in high IOPS workloads. The logs you pasted show the OSDs are peering - when they do that, cluster performance will drop considerably, especially if there is subsequent backfilling. Did the OSD processes crash? They will not just peer like that for no reason. The slow OSDs look like they could all be adjacent - did you inspect `ceph osd tree` to see if the OSDs all lie under the same host. Do you have network issues going from your client to that host? Also, 1 CPU core/OSD is definitely undersized. I'm not sure how much you have -- but you want at least a couple per OSD for SSD, and even more for NVMe... especially when it comes to small block write workloads. Cheers, Tyler _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx