Re: Tuning CephFS on NVME for HPC / IO500

Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> · Sun, 4 Dec 2022 09:08:17 +0100

Dear Sebastian,

Thank you for this insight. It sounds like something that is easy to try.

Does this relate to the Ceph cluster?

My use case is cephfs only. All my clients are Intel based and strictly
separated from the Ceph servers. Everything is bare metal.

Most information I found on IOMMU were related to virtualization
environments and VM hosts. In particular people using hyperconverged PVE
and virtualized/multi GPU environments appear to be interested in IOMMU.

Best wishes,
Manuel

Sebastian <sebcio.t@xxxxxxxxx> schrieb am Sa., 3. Dez. 2022, 22:51:

> One thing to this discussion.
> I had a lot of problems with my clusters. I spent some time debugging.
> What I found and what I confirmed on AMD nodes, everything starts working
> like a charm when I added to kernel param iommu=pt
> Plus some other tunings, I can’t share, all information now, but this
> iommu=pt should help.
> On beginning everything looks like something in kernel stack slowdown
> packets.
>
> BR,
> Sebastian
>
> > On 2 Dec 2022, at 16:03, Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote:
> >
> > Dear Mark.
> >
> > Thank you very much for all of this information. I learned a lot! In
> > particular that I need to learn more about pinning.
> >
> > In the end, I want to run the whole thing in production with real world
> > workloads. My main aim in running the benchmark is to ensure that my
> > hardware and OS is correctly configured (I already found some
> configuration
> > issues in my switches on the way with lack of balancing between LAG
> > interconnects and using layer 3+4 in creatikg my bonds, particularities
> of
> > Dell VLTi and needing unique VLT IDs...). Also, it will be interesting to
> > see how and whether things will turn out to be after the cluster has run
> > for a year.
> >
> > As far as I can see, network and OS configuration is sane. Ceph
> > configuration appears to be not too far off something that I could hand
> to
> > my users.
> >
> > I will try to play a bit more on the pinning and meta data tuning.
> >
> > Best wishes,
> > Manuel
> >
> > Mark Nelson <mnelson@xxxxxxxxxx> schrieb am Do., 1. Dez. 2022, 20:19:
> >
> >> Hi Manuel,
> >>
> >>
> >> I did the IO500 runs back in 2020 and wrote the cephfs aiori backend for
> >> IOR/mdtest.  Not sure about the segfault, it's been a while since I've
> >> touched that code.  It was working the last time I used it. :D  Having
> >> said that, I don't think that's your issue.   The userland backend
> >> helped work around an issue where I wasn't able to exceed about 3GB/s
> >> per host with the kernel client and thus couldn't hit more than about
> >> 30GB/s in the easy tests on a 10 node setup.  I think Jeff Layton might
> >> have fixed that issue when he improved the locking code in the kernel a
> >> while back and it appears you are getting good results with the kernel
> >> client in the easy tests.  I don't recall the userland backend
> >> performing much different than the kernel client in the other tests.
> >> Instead I would recommend looking at each test individually:
> >>
> >>
> >> ior-easy-write (and read):
> >>
> >> Each process gets it's own file, large aligned IO.  Pretty easy for the
> >> MDS and the rest of Ceph to handle.  You get better results overall than
> >> I did!  These are the tests we typically do best on out of the box.
> >>
> >>
> >> mdtest-easy-write (and stat/delete):
> >>
> >> Each process gets it's own directory writing out zero sized files.  The
> >> trick to getting good performance here is to use ephemeral pinning on
> >> the parent test directory.  Even better would be to use static round
> >> robin pinning for each rank's sub-directory.  Sadly that violates the
> >> rules now and we haven't implemented a way to do this with a single
> >> parent level xattr (though it would be pretty easy which makes the rule
> >> not to touch the subdirs kind of silly imho).  I was able to achieve up
> >> to around 10K IOPs per MDS, with the highest achieved score around
> >> 400-500K IOPS with 80 MDSes (but that configuration was suboptimal for
> >> other tests).  Ephemeral pinning is ok, but you need enough directories
> >> to avoid "clumpy" distribution across MDSes.  At ~320
> >> processes/directories and 40 MDSes I was seeing about half the
> >> performance vs doing perfect round-robin pinning of the individual
> >> process directories.  Well, with one exception:  When doing manual
> >> pinning, it's better to exclude the authoritative MDS for the parent
> >> directory (or perhaps just give it fewer directories than the others)
> >> since it's also doing other work and ends up lagging behind slowing the
> >> whole benchmark down.  Having said that, this is one of the easier tests
> >> to improve so long as you use some kind of reasonable pinning strategy
> >> with multiple MDSes.
> >>
> >>
> >> ior-hard-write (and read):
> >>
> >> Small unaligned IO to a single shared file.  I think it's ~47K IOs.
> >> This is rough to improve without code changes imho.  I remember the
> >> results being highly variable in my tests, and it took multiple runs to
> >> get a high score.  I don't remember exactly what I had to tweak here,
> >> but as opposed to the easy tests you are likely heavily latency bound
> >> even with 47K IOs.  I expect you are going to be slamming a single OSD
> >> (and PG!) over and over from multiple clients and constrained by how
> >> quickly you can get those IOs replicated (for writes when rep > 1) and
> >> locks acquired/released (in all cases).  I'm guessing that ensuring the
> >> lowest possible per-OSD latency and highest per-OSD throughput is
> >> probably a big win here.  Not sure what on the CephFS side might be
> >> playing a role, but I imagine caps and file level locking might matter.
> >> You can imagine that a system that let you just dump IO as a log-append
> >> straight to disk with some kind of clever scheme to avoid file based
> >> locking would do better here.
> >>
> >>
> >> mdtest-hard-write (and stat/delete):
> >>
> >> All processes writing 3901 byte files to a single directory. dirfrag
> >> splitting and exporting is a huge bottleneck.  The balancing code in the
> >> MDS can basically DDOS itself to the point where in a 30s (or even a 5
> >> minute!) test you never actually export anything to other MDSes.  You
> >> both end up servicing all requests on the authoritative MDS while
> >> simultaneously doing a bunch of work trying and failing to acquire locks
> >> to do the dirfrag exports.  If you do manage to actually get dirfrags
> >> onto other MDSes it can lead to performance improvements, but even then
> >> there are cyclical near-stalls in throughput that tank performance,
> >> likely related to further splitting and attempting to export dirfrags.
> >> As the subtree map grows, journal writes on the authoritative MDS for
> >> the parent directory become consuming.  If I recall it took a lot of
> >> screwing around with MDS and client counts to get a good result, and
> >> luck played a role too like in the ior-hard tests.  It was easy to do
> >> worse than just pinning to a single MDS.  FWIW, I usually saw higher
> >> aggregate performance with longer running tests than I did with lower
> >> running tests.
> >>
> >>
> >> find:
> >>
> >> Find a subset of the files created in the 4 above tests.  A bit of a
> >> ridiculous test frankly.  Results are highly dependent on the amount of
> >> files created in the easy vs hard mdtest cases above. The higher you
> >> skew toward easy tests, the better the find number becomes.  They should
> >> have separate find tests for easy mdtest and hard mdtest files and just
> >> ignore IOR entirely.
> >>
> >>
> >> FWIW there are some long-running efforts to improve some of the
> >> bottlenecks I mentioned, especially during subtree map journal writes.
> >> Zheng had a PR a while back but it was fairly complex and never got
> >> merged.  I believe Patrick is taking a crack at it now using a different
> >> approach.  FWIW, there are also a couple of good links from Matt
> >> Rásó-Barnett (Cambridge) and Glenn Lockwood (Formerly at NERSC, now
> >> heading up HPC IO strategy at Microsoft) that talk about some of IO500
> >> tests and the good and bad here:
> >>
> >>
> >>
> https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf
> >>
> >> https://www.glennklockwood.com/benchmarks/io500.html
> >>
> >>
> >> Mark
> >>
> >>
> >> On 12/1/22 01:26, Manuel Holtgrewe wrote:
> >>> Dear all,
> >>>
> >>> I am currently creating a CephFS setup for a HPC setting. I have a Ceph
> >>> v17.2.5 Cluster on Rocky Linux 8.7 (Kernel 4.18.0-425.3.1.el8.x86_64)
> >>> deployed with cephadm. I have 10 Ceph nodes with 2x100GbE LAG
> >> interconnect
> >>> and 36 client nodes with 2x25GbE LAG interconnect. We have Dell NOS10
> >>> switches deployed in VLT pairs. Overall, the network topology looks as
> >>> follows.
> >>>
> >>> 36 clients -- switch pair -- switch pair -- switch-pair -- 10 Ceph
> nodes
> >>>
> >>> The switch pairs are each connected with 8x100GbE LAG overall. Thus,
> the
> >>> theoretic network limit is ~80GB/sec.
> >>>
> >>> The client nodes also run Rocky Linux 8 and have 2x Intel(R) Xeon(R)
> Gold
> >>> 6240R CPU @ 2.40GHz CPUs. The Ceph nodes have 1x AMD EPYC 7413 24-Core
> >>> Processor and 250GB of RAM. All processors have hyperthreading
> enabled. I
> >>> have followed the guidance by Croit [1] and done the obvious hardware
> >>> tuning (configured the BIOS to make the OS do the power control, setup
> >> the
> >>> network with MTU 9000. I have deployed 3 MDS per server and have 20
> >> active
> >>> overall.
> >>>
> >>> The Ceph cluster nodes have 10x enterprise NVMEs each (all branded as
> >> "Dell
> >>> enterprise disks"), 8 older nodes (last year) have "Dell Ent NVMe v2
> AGN
> >> RI
> >>> U.2 15.36TB" which are Samsung disks, 2 newer nodes (just delivered)
> have
> >>> "Dell Ent NVMe CM6 RI 15.36TB" which are Kioxia disks. Interestingly,
> the
> >>> Kioxia disks show about 50% higher IOPs in the 4-processor fio test
> that
> >>> Croit suggests.
> >>>
> >>> I'm running the IO500 benchmark with 10 processes each on the clients.
> I
> >>> have pools setup with rep-1, rep-2, rep-3, and EC 8+2 and run the
> >>> benchmarks.
> >>>
> >>> So far, I have run "only" short tests with IO500 wall clock time of 30
> >>> secs. Good results for me are that I see "ior-easy-write" results of
> >>> 80GiB/sec so the Ceph cluster is able to saturate the switch network
> >>> interconnects. Bad results for me are that I cannot replicate the IO500
> >>> results from Red Hat in 2020.
> >>>
> >>> Below are the results that I get on the rep-1 pool.
> >>>
> >>> ```
> >>> IO500 version io500-sc22_v2 (standard)
> >>> [RESULT]       ior-easy-write       78.772830 GiB/s : time 97.098
> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-easy-write       37.375945 kIOPS : time 870.934
> >> seconds
> >>> [      ]            timestamp        0.000000 kIOPS : time 0.000
> seconds
> >>> [RESULT]       ior-hard-write        2.242241 GiB/s : time 35.431
> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-hard-write        2.575028 kIOPS : time 57.697
> seconds
> >>> [INVALID]
> >>> [RESULT]                 find     1072.770588 kIOPS : time 30.441
> seconds
> >>> [RESULT]        ior-easy-read       64.118118 GiB/s : time 118.982
> >> seconds
> >>> [RESULT]     mdtest-easy-stat      154.903631 kIOPS : time 210.887
> >> seconds
> >>> [RESULT]        ior-hard-read        4.285418 GiB/s : time 18.474
> seconds
> >>> [RESULT]     mdtest-hard-stat       40.126159 kIOPS : time 4.646
> seconds
> >>> [RESULT]   mdtest-easy-delete       39.296673 kIOPS : time 839.509
> >> seconds
> >>> [RESULT]     mdtest-hard-read       17.161306 kIOPS : time 9.505
> seconds
> >>> [RESULT]   mdtest-hard-delete        4.771440 kIOPS : time 31.931
> seconds
> >>> [SCORE ] Bandwidth 14.842537 GiB/s : IOPS 34.623082 kiops : TOTAL
> >> 22.669239
> >>> [INVALID]
> >>> ```
> >>>
> >>> I wonder whether I missed any tuning parameters or other "secret sauce"
> >>> that enabled the results from [2]:
> >>>
> >>> ```
> >>> [RESULT] BW   phase 1            ior_easy_write               36.255
> >> GiB/s
> >>> : time 387.94 seconds
> >>> [RESULT] IOPS phase 1         mdtest_easy_write              191.980
> >> kiops
> >>> : time 450.05 seconds
> >>> [RESULT] BW   phase 2            ior_hard_write                9.137
> >> GiB/s
> >>> : time 301.21 seconds
> >>> [RESULT] IOPS phase 2         mdtest_hard_write               17.187
> >> kiops
> >>> : time 393.55 seconds
> >>> [RESULT] IOPS phase 3                      find              965.790
> >> kiops
> >>> : time  96.46 seconds
> >>> [RESULT] BW   phase 3             ior_easy_read               75.621
> >> GiB/s
> >>> : time 185.75 seconds
> >>> [RESULT] IOPS phase 4          mdtest_easy_stat              903.112
> >> kiops
> >>> : time  95.67 seconds
> >>> [RESULT] BW   phase 4             ior_hard_read               19.080
> >> GiB/s
> >>> : time 144.22 seconds
> >>> [RESULT] IOPS phase 5          mdtest_hard_stat               97.399
> >> kiops
> >>> : time  69.44 seconds
> >>> [RESULT] IOPS phase 6        mdtest_easy_delete              123.455
> >> kiops
> >>> : time 699.85 seconds
> >>> [RESULT] IOPS phase 7          mdtest_hard_read               87.512
> >> kiops
> >>> : time  77.29 seconds
> >>> [RESULT] IOPS phase 8        mdtest_hard_delete               18.814
> >> kiops
> >>> : time 390.91 seconds
> >>> [SCORE] Bandwidth 26.2933 GiB/s : IOPS 124.297 kiops : TOTAL 57.168
> >>> ```
> >>>
> >>> It looks like my results are more in the same order as the SUSE results
> >>> from 2019 [3].
> >>>
> >>> ```
> >>> [RESULT] BW   phase 1            ior_easy_write               16.072
> >> GB/s :
> >>> time 347.39 seconds
> >>> [RESULT] IOPS phase 1         mdtest_easy_write               32.822
> >> kiops
> >>> : time 365.67 seconds
> >>> [RESULT] BW   phase 2            ior_hard_write                1.572
> >> GB/s :
> >>> time 359.20 seconds
> >>> [RESULT] IOPS phase 2         mdtest_hard_write               12.917
> >> kiops
> >>> : time 317.70 seconds
> >>> [RESULT] IOPS phase 3                      find              250.500
> >> kiops
> >>> : time  64.28 seconds
> >>> [RESULT] BW   phase 3             ior_easy_read                9.139
> >> GB/s :
> >>> time 600.48 seconds
> >>> [RESULT] IOPS phase 4          mdtest_easy_stat              127.919
> >> kiops
> >>> : time  93.82 seconds
> >>> [RESULT] BW   phase 4             ior_hard_read                4.698
> >> GB/s :
> >>> time 120.17 seconds
> >>> [RESULT] IOPS phase 5          mdtest_hard_stat               68.791
> >> kiops
> >>> : time  59.65 seconds
> >>> [RESULT] IOPS phase 6        mdtest_easy_delete               20.845
> >> kiops
> >>> : time 575.70 seconds
> >>> [RESULT] IOPS phase 7          mdtest_hard_read               41.640
> >> kiops
> >>> : time  98.55 seconds
> >>> [RESULT] IOPS phase 8        mdtest_hard_delete                6.224
> >> kiops
> >>> : time 660.50 seconds
> >>> [SCORE] Bandwidth 5.73936 GB/s : IOPS 38.7169 kiops : TOTAL 14.9067
> >>> ```
> >>>
> >>> One difference I could find is that the Red Hat results use the CEPHFS
> >>> backend of IO500 (that I cannot get to work properly because of a crash
> >>> "Caught signal 11 (Segmentation fault: address not mapped to object at
> >>> address (nil))" in libucs.so. SUSE used the POSIX backend.
> >>>
> >>> Changing from 1 OSD server per NVME to 2 did not help too much either.
> >>>
> >>> Maybe someone on the list has an idea for something else to try?
> >>>
> >>> Oh, in case anyone is interested, here are some results using the
> rep-2,
> >>> rep-3, and ec-8-2 pool.
> >>>
> >>> ```
> >>>                       *** pool=rep-2 NP=360 ***
> >>> IO500 version io500-sc22_v2 (standard)
> >>> [RESULT]       ior-easy-write       39.613736 GiB/s : time 153.508
> >> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-easy-write       13.932462 kIOPS : time 38.119
> seconds
> >>> [INVALID]
> >>> [      ]            timestamp        0.000000 kIOPS : time 0.000
> seconds
> >>> [RESULT]       ior-hard-write        1.809117 GiB/s : time 39.019
> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-hard-write        4.925225 kIOPS : time 37.654
> seconds
> >>> [INVALID]
> >>> [RESULT]                 find       69.063353 kIOPS : time 9.042
> seconds
> >>> [RESULT]        ior-easy-read       59.503973 GiB/s : time 102.166
> >> seconds
> >>> [RESULT]     mdtest-easy-stat      143.589003 kIOPS : time 4.097
> seconds
> >>> [RESULT]        ior-hard-read        4.104325 GiB/s : time 14.868
> seconds
> >>> [RESULT]     mdtest-hard-stat      156.252159 kIOPS : time 2.204
> seconds
> >>> [RESULT]   mdtest-easy-delete       35.312782 kIOPS : time 14.249
> seconds
> >>> [RESULT]     mdtest-hard-read       67.097465 kIOPS : time 3.739
> seconds
> >>> [RESULT]   mdtest-hard-delete       11.018869 kIOPS : time 18.060
> seconds
> >>> [SCORE ] Bandwidth 11.502045 GiB/s : IOPS 35.927582 kiops : TOTAL
> >> 20.328322
> >>> [INVALID]
> >>>
> >>>
> >>>                       *** pool=rep-3 NP=360 ***
> >>> IO500 version io500-sc22_v2 (standard)
> >>> [RESULT]       ior-easy-write       27.481332 GiB/s : time 204.973
> >> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-easy-write       27.699574 kIOPS : time 1502.596
> >> seconds
> >>> [      ]            timestamp        0.000000 kIOPS : time 0.000
> seconds
> >>> [RESULT]       ior-hard-write        1.352186 GiB/s : time 38.273
> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-hard-write        3.024279 kIOPS : time 48.923
> seconds
> >>> [INVALID]
> >>> [RESULT]                 find      777.440295 kIOPS : time 53.684
> seconds
> >>> [RESULT]        ior-easy-read       58.686272 GiB/s : time 95.992
> seconds
> >>> [RESULT]     mdtest-easy-stat      156.499256 kIOPS : time 266.755
> >> seconds
> >>> [RESULT]        ior-hard-read        4.095575 GiB/s : time 12.649
> seconds
> >>> [RESULT]     mdtest-hard-stat       62.831560 kIOPS : time 3.318
> seconds
> >>> [RESULT]   mdtest-easy-delete       25.909017 kIOPS : time 1606.960
> >> seconds
> >>> [RESULT]     mdtest-hard-read       16.586529 kIOPS : time 9.735
> seconds
> >>> [RESULT]   mdtest-hard-delete        9.093536 kIOPS : time 18.615
> seconds
> >>> [SCORE ] Bandwidth 9.721458 GiB/s : IOPS 35.464915 kiops : TOTAL
> >> 18.568002
> >>> [INVALID]
> >>>
> >>>
> >>>                       *** pool=ec-8-2 NP=360 ***
> >>> IO500 version io500-sc22_v2 (standard)
> >>> [RESULT]       ior-easy-write       40.480456 GiB/s : time 151.451
> >> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-easy-write       32.507690 kIOPS : time 444.424
> >> seconds
> >>> [      ]            timestamp        0.000000 kIOPS : time 0.000
> seconds
> >>> [RESULT]       ior-hard-write        0.570092 GiB/s : time 35.986
> seconds
> >>> [INVALID]
> >>> [RESULT]    mdtest-hard-write        3.287144 kIOPS : time 40.114
> seconds
> >>> [INVALID]
> >>> [RESULT]                 find     1779.068273 kIOPS : time 8.177
> seconds
> >>> [RESULT]        ior-easy-read       56.463968 GiB/s : time 108.661
> >> seconds
> >>> [RESULT]     mdtest-easy-stat      179.334380 kIOPS : time 81.380
> seconds
> >>> [RESULT]        ior-hard-read        1.957840 GiB/s : time 10.484
> seconds
> >>> [RESULT]     mdtest-hard-stat       92.430508 kIOPS : time 2.402
> seconds
> >>> [RESULT]   mdtest-easy-delete       29.549239 kIOPS : time 489.285
> >> seconds
> >>> [RESULT]     mdtest-hard-read       26.989114 kIOPS : time 5.770
> seconds
> >>> [RESULT]   mdtest-hard-delete       26.500674 kIOPS : time 6.038
> seconds
> >>> [SCORE ] Bandwidth 7.106974 GiB/s : IOPS 53.448254 kiops : TOTAL
> >> 19.489879
> >>> [INVALID]
> >>> ```
> >>>
> >>> Best wishes,
> >>> Manuel
> >>>
> >>> [1] https://croit.io/blog/ceph-performance-test-and-optimization
> >>> [2] https://io500.org/submissions/view/82
> >>> [3] https://io500.org/submissions/view/141
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx