One thing to this discussion. I had a lot of problems with my clusters. I spent some time debugging. What I found and what I confirmed on AMD nodes, everything starts working like a charm when I added to kernel param iommu=pt Plus some other tunings, I can’t share, all information now, but this iommu=pt should help. On beginning everything looks like something in kernel stack slowdown packets. BR, Sebastian > On 2 Dec 2022, at 16:03, Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote: > > Dear Mark. > > Thank you very much for all of this information. I learned a lot! In > particular that I need to learn more about pinning. > > In the end, I want to run the whole thing in production with real world > workloads. My main aim in running the benchmark is to ensure that my > hardware and OS is correctly configured (I already found some configuration > issues in my switches on the way with lack of balancing between LAG > interconnects and using layer 3+4 in creatikg my bonds, particularities of > Dell VLTi and needing unique VLT IDs...). Also, it will be interesting to > see how and whether things will turn out to be after the cluster has run > for a year. > > As far as I can see, network and OS configuration is sane. Ceph > configuration appears to be not too far off something that I could hand to > my users. > > I will try to play a bit more on the pinning and meta data tuning. > > Best wishes, > Manuel > > Mark Nelson <mnelson@xxxxxxxxxx> schrieb am Do., 1. Dez. 2022, 20:19: > >> Hi Manuel, >> >> >> I did the IO500 runs back in 2020 and wrote the cephfs aiori backend for >> IOR/mdtest. Not sure about the segfault, it's been a while since I've >> touched that code. It was working the last time I used it. :D Having >> said that, I don't think that's your issue. The userland backend >> helped work around an issue where I wasn't able to exceed about 3GB/s >> per host with the kernel client and thus couldn't hit more than about >> 30GB/s in the easy tests on a 10 node setup. I think Jeff Layton might >> have fixed that issue when he improved the locking code in the kernel a >> while back and it appears you are getting good results with the kernel >> client in the easy tests. I don't recall the userland backend >> performing much different than the kernel client in the other tests. >> Instead I would recommend looking at each test individually: >> >> >> ior-easy-write (and read): >> >> Each process gets it's own file, large aligned IO. Pretty easy for the >> MDS and the rest of Ceph to handle. You get better results overall than >> I did! These are the tests we typically do best on out of the box. >> >> >> mdtest-easy-write (and stat/delete): >> >> Each process gets it's own directory writing out zero sized files. The >> trick to getting good performance here is to use ephemeral pinning on >> the parent test directory. Even better would be to use static round >> robin pinning for each rank's sub-directory. Sadly that violates the >> rules now and we haven't implemented a way to do this with a single >> parent level xattr (though it would be pretty easy which makes the rule >> not to touch the subdirs kind of silly imho). I was able to achieve up >> to around 10K IOPs per MDS, with the highest achieved score around >> 400-500K IOPS with 80 MDSes (but that configuration was suboptimal for >> other tests). Ephemeral pinning is ok, but you need enough directories >> to avoid "clumpy" distribution across MDSes. At ~320 >> processes/directories and 40 MDSes I was seeing about half the >> performance vs doing perfect round-robin pinning of the individual >> process directories. Well, with one exception: When doing manual >> pinning, it's better to exclude the authoritative MDS for the parent >> directory (or perhaps just give it fewer directories than the others) >> since it's also doing other work and ends up lagging behind slowing the >> whole benchmark down. Having said that, this is one of the easier tests >> to improve so long as you use some kind of reasonable pinning strategy >> with multiple MDSes. >> >> >> ior-hard-write (and read): >> >> Small unaligned IO to a single shared file. I think it's ~47K IOs. >> This is rough to improve without code changes imho. I remember the >> results being highly variable in my tests, and it took multiple runs to >> get a high score. I don't remember exactly what I had to tweak here, >> but as opposed to the easy tests you are likely heavily latency bound >> even with 47K IOs. I expect you are going to be slamming a single OSD >> (and PG!) over and over from multiple clients and constrained by how >> quickly you can get those IOs replicated (for writes when rep > 1) and >> locks acquired/released (in all cases). I'm guessing that ensuring the >> lowest possible per-OSD latency and highest per-OSD throughput is >> probably a big win here. Not sure what on the CephFS side might be >> playing a role, but I imagine caps and file level locking might matter. >> You can imagine that a system that let you just dump IO as a log-append >> straight to disk with some kind of clever scheme to avoid file based >> locking would do better here. >> >> >> mdtest-hard-write (and stat/delete): >> >> All processes writing 3901 byte files to a single directory. dirfrag >> splitting and exporting is a huge bottleneck. The balancing code in the >> MDS can basically DDOS itself to the point where in a 30s (or even a 5 >> minute!) test you never actually export anything to other MDSes. You >> both end up servicing all requests on the authoritative MDS while >> simultaneously doing a bunch of work trying and failing to acquire locks >> to do the dirfrag exports. If you do manage to actually get dirfrags >> onto other MDSes it can lead to performance improvements, but even then >> there are cyclical near-stalls in throughput that tank performance, >> likely related to further splitting and attempting to export dirfrags. >> As the subtree map grows, journal writes on the authoritative MDS for >> the parent directory become consuming. If I recall it took a lot of >> screwing around with MDS and client counts to get a good result, and >> luck played a role too like in the ior-hard tests. It was easy to do >> worse than just pinning to a single MDS. FWIW, I usually saw higher >> aggregate performance with longer running tests than I did with lower >> running tests. >> >> >> find: >> >> Find a subset of the files created in the 4 above tests. A bit of a >> ridiculous test frankly. Results are highly dependent on the amount of >> files created in the easy vs hard mdtest cases above. The higher you >> skew toward easy tests, the better the find number becomes. They should >> have separate find tests for easy mdtest and hard mdtest files and just >> ignore IOR entirely. >> >> >> FWIW there are some long-running efforts to improve some of the >> bottlenecks I mentioned, especially during subtree map journal writes. >> Zheng had a PR a while back but it was fairly complex and never got >> merged. I believe Patrick is taking a crack at it now using a different >> approach. FWIW, there are also a couple of good links from Matt >> Rásó-Barnett (Cambridge) and Glenn Lockwood (Formerly at NERSC, now >> heading up HPC IO strategy at Microsoft) that talk about some of IO500 >> tests and the good and bad here: >> >> >> https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf >> >> https://www.glennklockwood.com/benchmarks/io500.html >> >> >> Mark >> >> >> On 12/1/22 01:26, Manuel Holtgrewe wrote: >>> Dear all, >>> >>> I am currently creating a CephFS setup for a HPC setting. I have a Ceph >>> v17.2.5 Cluster on Rocky Linux 8.7 (Kernel 4.18.0-425.3.1.el8.x86_64) >>> deployed with cephadm. I have 10 Ceph nodes with 2x100GbE LAG >> interconnect >>> and 36 client nodes with 2x25GbE LAG interconnect. We have Dell NOS10 >>> switches deployed in VLT pairs. Overall, the network topology looks as >>> follows. >>> >>> 36 clients -- switch pair -- switch pair -- switch-pair -- 10 Ceph nodes >>> >>> The switch pairs are each connected with 8x100GbE LAG overall. Thus, the >>> theoretic network limit is ~80GB/sec. >>> >>> The client nodes also run Rocky Linux 8 and have 2x Intel(R) Xeon(R) Gold >>> 6240R CPU @ 2.40GHz CPUs. The Ceph nodes have 1x AMD EPYC 7413 24-Core >>> Processor and 250GB of RAM. All processors have hyperthreading enabled. I >>> have followed the guidance by Croit [1] and done the obvious hardware >>> tuning (configured the BIOS to make the OS do the power control, setup >> the >>> network with MTU 9000. I have deployed 3 MDS per server and have 20 >> active >>> overall. >>> >>> The Ceph cluster nodes have 10x enterprise NVMEs each (all branded as >> "Dell >>> enterprise disks"), 8 older nodes (last year) have "Dell Ent NVMe v2 AGN >> RI >>> U.2 15.36TB" which are Samsung disks, 2 newer nodes (just delivered) have >>> "Dell Ent NVMe CM6 RI 15.36TB" which are Kioxia disks. Interestingly, the >>> Kioxia disks show about 50% higher IOPs in the 4-processor fio test that >>> Croit suggests. >>> >>> I'm running the IO500 benchmark with 10 processes each on the clients. I >>> have pools setup with rep-1, rep-2, rep-3, and EC 8+2 and run the >>> benchmarks. >>> >>> So far, I have run "only" short tests with IO500 wall clock time of 30 >>> secs. Good results for me are that I see "ior-easy-write" results of >>> 80GiB/sec so the Ceph cluster is able to saturate the switch network >>> interconnects. Bad results for me are that I cannot replicate the IO500 >>> results from Red Hat in 2020. >>> >>> Below are the results that I get on the rep-1 pool. >>> >>> ``` >>> IO500 version io500-sc22_v2 (standard) >>> [RESULT] ior-easy-write 78.772830 GiB/s : time 97.098 seconds >>> [INVALID] >>> [RESULT] mdtest-easy-write 37.375945 kIOPS : time 870.934 >> seconds >>> [ ] timestamp 0.000000 kIOPS : time 0.000 seconds >>> [RESULT] ior-hard-write 2.242241 GiB/s : time 35.431 seconds >>> [INVALID] >>> [RESULT] mdtest-hard-write 2.575028 kIOPS : time 57.697 seconds >>> [INVALID] >>> [RESULT] find 1072.770588 kIOPS : time 30.441 seconds >>> [RESULT] ior-easy-read 64.118118 GiB/s : time 118.982 >> seconds >>> [RESULT] mdtest-easy-stat 154.903631 kIOPS : time 210.887 >> seconds >>> [RESULT] ior-hard-read 4.285418 GiB/s : time 18.474 seconds >>> [RESULT] mdtest-hard-stat 40.126159 kIOPS : time 4.646 seconds >>> [RESULT] mdtest-easy-delete 39.296673 kIOPS : time 839.509 >> seconds >>> [RESULT] mdtest-hard-read 17.161306 kIOPS : time 9.505 seconds >>> [RESULT] mdtest-hard-delete 4.771440 kIOPS : time 31.931 seconds >>> [SCORE ] Bandwidth 14.842537 GiB/s : IOPS 34.623082 kiops : TOTAL >> 22.669239 >>> [INVALID] >>> ``` >>> >>> I wonder whether I missed any tuning parameters or other "secret sauce" >>> that enabled the results from [2]: >>> >>> ``` >>> [RESULT] BW phase 1 ior_easy_write 36.255 >> GiB/s >>> : time 387.94 seconds >>> [RESULT] IOPS phase 1 mdtest_easy_write 191.980 >> kiops >>> : time 450.05 seconds >>> [RESULT] BW phase 2 ior_hard_write 9.137 >> GiB/s >>> : time 301.21 seconds >>> [RESULT] IOPS phase 2 mdtest_hard_write 17.187 >> kiops >>> : time 393.55 seconds >>> [RESULT] IOPS phase 3 find 965.790 >> kiops >>> : time 96.46 seconds >>> [RESULT] BW phase 3 ior_easy_read 75.621 >> GiB/s >>> : time 185.75 seconds >>> [RESULT] IOPS phase 4 mdtest_easy_stat 903.112 >> kiops >>> : time 95.67 seconds >>> [RESULT] BW phase 4 ior_hard_read 19.080 >> GiB/s >>> : time 144.22 seconds >>> [RESULT] IOPS phase 5 mdtest_hard_stat 97.399 >> kiops >>> : time 69.44 seconds >>> [RESULT] IOPS phase 6 mdtest_easy_delete 123.455 >> kiops >>> : time 699.85 seconds >>> [RESULT] IOPS phase 7 mdtest_hard_read 87.512 >> kiops >>> : time 77.29 seconds >>> [RESULT] IOPS phase 8 mdtest_hard_delete 18.814 >> kiops >>> : time 390.91 seconds >>> [SCORE] Bandwidth 26.2933 GiB/s : IOPS 124.297 kiops : TOTAL 57.168 >>> ``` >>> >>> It looks like my results are more in the same order as the SUSE results >>> from 2019 [3]. >>> >>> ``` >>> [RESULT] BW phase 1 ior_easy_write 16.072 >> GB/s : >>> time 347.39 seconds >>> [RESULT] IOPS phase 1 mdtest_easy_write 32.822 >> kiops >>> : time 365.67 seconds >>> [RESULT] BW phase 2 ior_hard_write 1.572 >> GB/s : >>> time 359.20 seconds >>> [RESULT] IOPS phase 2 mdtest_hard_write 12.917 >> kiops >>> : time 317.70 seconds >>> [RESULT] IOPS phase 3 find 250.500 >> kiops >>> : time 64.28 seconds >>> [RESULT] BW phase 3 ior_easy_read 9.139 >> GB/s : >>> time 600.48 seconds >>> [RESULT] IOPS phase 4 mdtest_easy_stat 127.919 >> kiops >>> : time 93.82 seconds >>> [RESULT] BW phase 4 ior_hard_read 4.698 >> GB/s : >>> time 120.17 seconds >>> [RESULT] IOPS phase 5 mdtest_hard_stat 68.791 >> kiops >>> : time 59.65 seconds >>> [RESULT] IOPS phase 6 mdtest_easy_delete 20.845 >> kiops >>> : time 575.70 seconds >>> [RESULT] IOPS phase 7 mdtest_hard_read 41.640 >> kiops >>> : time 98.55 seconds >>> [RESULT] IOPS phase 8 mdtest_hard_delete 6.224 >> kiops >>> : time 660.50 seconds >>> [SCORE] Bandwidth 5.73936 GB/s : IOPS 38.7169 kiops : TOTAL 14.9067 >>> ``` >>> >>> One difference I could find is that the Red Hat results use the CEPHFS >>> backend of IO500 (that I cannot get to work properly because of a crash >>> "Caught signal 11 (Segmentation fault: address not mapped to object at >>> address (nil))" in libucs.so. SUSE used the POSIX backend. >>> >>> Changing from 1 OSD server per NVME to 2 did not help too much either. >>> >>> Maybe someone on the list has an idea for something else to try? >>> >>> Oh, in case anyone is interested, here are some results using the rep-2, >>> rep-3, and ec-8-2 pool. >>> >>> ``` >>> *** pool=rep-2 NP=360 *** >>> IO500 version io500-sc22_v2 (standard) >>> [RESULT] ior-easy-write 39.613736 GiB/s : time 153.508 >> seconds >>> [INVALID] >>> [RESULT] mdtest-easy-write 13.932462 kIOPS : time 38.119 seconds >>> [INVALID] >>> [ ] timestamp 0.000000 kIOPS : time 0.000 seconds >>> [RESULT] ior-hard-write 1.809117 GiB/s : time 39.019 seconds >>> [INVALID] >>> [RESULT] mdtest-hard-write 4.925225 kIOPS : time 37.654 seconds >>> [INVALID] >>> [RESULT] find 69.063353 kIOPS : time 9.042 seconds >>> [RESULT] ior-easy-read 59.503973 GiB/s : time 102.166 >> seconds >>> [RESULT] mdtest-easy-stat 143.589003 kIOPS : time 4.097 seconds >>> [RESULT] ior-hard-read 4.104325 GiB/s : time 14.868 seconds >>> [RESULT] mdtest-hard-stat 156.252159 kIOPS : time 2.204 seconds >>> [RESULT] mdtest-easy-delete 35.312782 kIOPS : time 14.249 seconds >>> [RESULT] mdtest-hard-read 67.097465 kIOPS : time 3.739 seconds >>> [RESULT] mdtest-hard-delete 11.018869 kIOPS : time 18.060 seconds >>> [SCORE ] Bandwidth 11.502045 GiB/s : IOPS 35.927582 kiops : TOTAL >> 20.328322 >>> [INVALID] >>> >>> >>> *** pool=rep-3 NP=360 *** >>> IO500 version io500-sc22_v2 (standard) >>> [RESULT] ior-easy-write 27.481332 GiB/s : time 204.973 >> seconds >>> [INVALID] >>> [RESULT] mdtest-easy-write 27.699574 kIOPS : time 1502.596 >> seconds >>> [ ] timestamp 0.000000 kIOPS : time 0.000 seconds >>> [RESULT] ior-hard-write 1.352186 GiB/s : time 38.273 seconds >>> [INVALID] >>> [RESULT] mdtest-hard-write 3.024279 kIOPS : time 48.923 seconds >>> [INVALID] >>> [RESULT] find 777.440295 kIOPS : time 53.684 seconds >>> [RESULT] ior-easy-read 58.686272 GiB/s : time 95.992 seconds >>> [RESULT] mdtest-easy-stat 156.499256 kIOPS : time 266.755 >> seconds >>> [RESULT] ior-hard-read 4.095575 GiB/s : time 12.649 seconds >>> [RESULT] mdtest-hard-stat 62.831560 kIOPS : time 3.318 seconds >>> [RESULT] mdtest-easy-delete 25.909017 kIOPS : time 1606.960 >> seconds >>> [RESULT] mdtest-hard-read 16.586529 kIOPS : time 9.735 seconds >>> [RESULT] mdtest-hard-delete 9.093536 kIOPS : time 18.615 seconds >>> [SCORE ] Bandwidth 9.721458 GiB/s : IOPS 35.464915 kiops : TOTAL >> 18.568002 >>> [INVALID] >>> >>> >>> *** pool=ec-8-2 NP=360 *** >>> IO500 version io500-sc22_v2 (standard) >>> [RESULT] ior-easy-write 40.480456 GiB/s : time 151.451 >> seconds >>> [INVALID] >>> [RESULT] mdtest-easy-write 32.507690 kIOPS : time 444.424 >> seconds >>> [ ] timestamp 0.000000 kIOPS : time 0.000 seconds >>> [RESULT] ior-hard-write 0.570092 GiB/s : time 35.986 seconds >>> [INVALID] >>> [RESULT] mdtest-hard-write 3.287144 kIOPS : time 40.114 seconds >>> [INVALID] >>> [RESULT] find 1779.068273 kIOPS : time 8.177 seconds >>> [RESULT] ior-easy-read 56.463968 GiB/s : time 108.661 >> seconds >>> [RESULT] mdtest-easy-stat 179.334380 kIOPS : time 81.380 seconds >>> [RESULT] ior-hard-read 1.957840 GiB/s : time 10.484 seconds >>> [RESULT] mdtest-hard-stat 92.430508 kIOPS : time 2.402 seconds >>> [RESULT] mdtest-easy-delete 29.549239 kIOPS : time 489.285 >> seconds >>> [RESULT] mdtest-hard-read 26.989114 kIOPS : time 5.770 seconds >>> [RESULT] mdtest-hard-delete 26.500674 kIOPS : time 6.038 seconds >>> [SCORE ] Bandwidth 7.106974 GiB/s : IOPS 53.448254 kiops : TOTAL >> 19.489879 >>> [INVALID] >>> ``` >>> >>> Best wishes, >>> Manuel >>> >>> [1] https://croit.io/blog/ceph-performance-test-and-optimization >>> [2] https://io500.org/submissions/view/82 >>> [3] https://io500.org/submissions/view/141 >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx