Re: Tuning CephFS on NVME for HPC / IO500

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One thing to this discussion. 
I had a lot of problems with my clusters. I spent some time debugging.
What I found and what I confirmed on AMD nodes, everything starts working like a charm when I added to kernel param iommu=pt
Plus some other tunings, I can’t share, all information now, but this iommu=pt should help.
On beginning everything looks like something in kernel stack slowdown packets.

BR,
Sebastian

> On 2 Dec 2022, at 16:03, Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote:
> 
> Dear Mark.
> 
> Thank you very much for all of this information. I learned a lot! In
> particular that I need to learn more about pinning.
> 
> In the end, I want to run the whole thing in production with real world
> workloads. My main aim in running the benchmark is to ensure that my
> hardware and OS is correctly configured (I already found some configuration
> issues in my switches on the way with lack of balancing between LAG
> interconnects and using layer 3+4 in creatikg my bonds, particularities of
> Dell VLTi and needing unique VLT IDs...). Also, it will be interesting to
> see how and whether things will turn out to be after the cluster has run
> for a year.
> 
> As far as I can see, network and OS configuration is sane. Ceph
> configuration appears to be not too far off something that I could hand to
> my users.
> 
> I will try to play a bit more on the pinning and meta data tuning.
> 
> Best wishes,
> Manuel
> 
> Mark Nelson <mnelson@xxxxxxxxxx> schrieb am Do., 1. Dez. 2022, 20:19:
> 
>> Hi Manuel,
>> 
>> 
>> I did the IO500 runs back in 2020 and wrote the cephfs aiori backend for
>> IOR/mdtest.  Not sure about the segfault, it's been a while since I've
>> touched that code.  It was working the last time I used it. :D  Having
>> said that, I don't think that's your issue.   The userland backend
>> helped work around an issue where I wasn't able to exceed about 3GB/s
>> per host with the kernel client and thus couldn't hit more than about
>> 30GB/s in the easy tests on a 10 node setup.  I think Jeff Layton might
>> have fixed that issue when he improved the locking code in the kernel a
>> while back and it appears you are getting good results with the kernel
>> client in the easy tests.  I don't recall the userland backend
>> performing much different than the kernel client in the other tests.
>> Instead I would recommend looking at each test individually:
>> 
>> 
>> ior-easy-write (and read):
>> 
>> Each process gets it's own file, large aligned IO.  Pretty easy for the
>> MDS and the rest of Ceph to handle.  You get better results overall than
>> I did!  These are the tests we typically do best on out of the box.
>> 
>> 
>> mdtest-easy-write (and stat/delete):
>> 
>> Each process gets it's own directory writing out zero sized files.  The
>> trick to getting good performance here is to use ephemeral pinning on
>> the parent test directory.  Even better would be to use static round
>> robin pinning for each rank's sub-directory.  Sadly that violates the
>> rules now and we haven't implemented a way to do this with a single
>> parent level xattr (though it would be pretty easy which makes the rule
>> not to touch the subdirs kind of silly imho).  I was able to achieve up
>> to around 10K IOPs per MDS, with the highest achieved score around
>> 400-500K IOPS with 80 MDSes (but that configuration was suboptimal for
>> other tests).  Ephemeral pinning is ok, but you need enough directories
>> to avoid "clumpy" distribution across MDSes.  At ~320
>> processes/directories and 40 MDSes I was seeing about half the
>> performance vs doing perfect round-robin pinning of the individual
>> process directories.  Well, with one exception:  When doing manual
>> pinning, it's better to exclude the authoritative MDS for the parent
>> directory (or perhaps just give it fewer directories than the others)
>> since it's also doing other work and ends up lagging behind slowing the
>> whole benchmark down.  Having said that, this is one of the easier tests
>> to improve so long as you use some kind of reasonable pinning strategy
>> with multiple MDSes.
>> 
>> 
>> ior-hard-write (and read):
>> 
>> Small unaligned IO to a single shared file.  I think it's ~47K IOs.
>> This is rough to improve without code changes imho.  I remember the
>> results being highly variable in my tests, and it took multiple runs to
>> get a high score.  I don't remember exactly what I had to tweak here,
>> but as opposed to the easy tests you are likely heavily latency bound
>> even with 47K IOs.  I expect you are going to be slamming a single OSD
>> (and PG!) over and over from multiple clients and constrained by how
>> quickly you can get those IOs replicated (for writes when rep > 1) and
>> locks acquired/released (in all cases).  I'm guessing that ensuring the
>> lowest possible per-OSD latency and highest per-OSD throughput is
>> probably a big win here.  Not sure what on the CephFS side might be
>> playing a role, but I imagine caps and file level locking might matter.
>> You can imagine that a system that let you just dump IO as a log-append
>> straight to disk with some kind of clever scheme to avoid file based
>> locking would do better here.
>> 
>> 
>> mdtest-hard-write (and stat/delete):
>> 
>> All processes writing 3901 byte files to a single directory. dirfrag
>> splitting and exporting is a huge bottleneck.  The balancing code in the
>> MDS can basically DDOS itself to the point where in a 30s (or even a 5
>> minute!) test you never actually export anything to other MDSes.  You
>> both end up servicing all requests on the authoritative MDS while
>> simultaneously doing a bunch of work trying and failing to acquire locks
>> to do the dirfrag exports.  If you do manage to actually get dirfrags
>> onto other MDSes it can lead to performance improvements, but even then
>> there are cyclical near-stalls in throughput that tank performance,
>> likely related to further splitting and attempting to export dirfrags.
>> As the subtree map grows, journal writes on the authoritative MDS for
>> the parent directory become consuming.  If I recall it took a lot of
>> screwing around with MDS and client counts to get a good result, and
>> luck played a role too like in the ior-hard tests.  It was easy to do
>> worse than just pinning to a single MDS.  FWIW, I usually saw higher
>> aggregate performance with longer running tests than I did with lower
>> running tests.
>> 
>> 
>> find:
>> 
>> Find a subset of the files created in the 4 above tests.  A bit of a
>> ridiculous test frankly.  Results are highly dependent on the amount of
>> files created in the easy vs hard mdtest cases above. The higher you
>> skew toward easy tests, the better the find number becomes.  They should
>> have separate find tests for easy mdtest and hard mdtest files and just
>> ignore IOR entirely.
>> 
>> 
>> FWIW there are some long-running efforts to improve some of the
>> bottlenecks I mentioned, especially during subtree map journal writes.
>> Zheng had a PR a while back but it was fairly complex and never got
>> merged.  I believe Patrick is taking a crack at it now using a different
>> approach.  FWIW, there are also a couple of good links from Matt
>> Rásó-Barnett (Cambridge) and Glenn Lockwood (Formerly at NERSC, now
>> heading up HPC IO strategy at Microsoft) that talk about some of IO500
>> tests and the good and bad here:
>> 
>> 
>> https://www.eofs.eu/_media/events/lad19/03_matt_raso-barnett-io500-cambridge.pdf
>> 
>> https://www.glennklockwood.com/benchmarks/io500.html
>> 
>> 
>> Mark
>> 
>> 
>> On 12/1/22 01:26, Manuel Holtgrewe wrote:
>>> Dear all,
>>> 
>>> I am currently creating a CephFS setup for a HPC setting. I have a Ceph
>>> v17.2.5 Cluster on Rocky Linux 8.7 (Kernel 4.18.0-425.3.1.el8.x86_64)
>>> deployed with cephadm. I have 10 Ceph nodes with 2x100GbE LAG
>> interconnect
>>> and 36 client nodes with 2x25GbE LAG interconnect. We have Dell NOS10
>>> switches deployed in VLT pairs. Overall, the network topology looks as
>>> follows.
>>> 
>>> 36 clients -- switch pair -- switch pair -- switch-pair -- 10 Ceph nodes
>>> 
>>> The switch pairs are each connected with 8x100GbE LAG overall. Thus, the
>>> theoretic network limit is ~80GB/sec.
>>> 
>>> The client nodes also run Rocky Linux 8 and have 2x Intel(R) Xeon(R) Gold
>>> 6240R CPU @ 2.40GHz CPUs. The Ceph nodes have 1x AMD EPYC 7413 24-Core
>>> Processor and 250GB of RAM. All processors have hyperthreading enabled. I
>>> have followed the guidance by Croit [1] and done the obvious hardware
>>> tuning (configured the BIOS to make the OS do the power control, setup
>> the
>>> network with MTU 9000. I have deployed 3 MDS per server and have 20
>> active
>>> overall.
>>> 
>>> The Ceph cluster nodes have 10x enterprise NVMEs each (all branded as
>> "Dell
>>> enterprise disks"), 8 older nodes (last year) have "Dell Ent NVMe v2 AGN
>> RI
>>> U.2 15.36TB" which are Samsung disks, 2 newer nodes (just delivered) have
>>> "Dell Ent NVMe CM6 RI 15.36TB" which are Kioxia disks. Interestingly, the
>>> Kioxia disks show about 50% higher IOPs in the 4-processor fio test that
>>> Croit suggests.
>>> 
>>> I'm running the IO500 benchmark with 10 processes each on the clients. I
>>> have pools setup with rep-1, rep-2, rep-3, and EC 8+2 and run the
>>> benchmarks.
>>> 
>>> So far, I have run "only" short tests with IO500 wall clock time of 30
>>> secs. Good results for me are that I see "ior-easy-write" results of
>>> 80GiB/sec so the Ceph cluster is able to saturate the switch network
>>> interconnects. Bad results for me are that I cannot replicate the IO500
>>> results from Red Hat in 2020.
>>> 
>>> Below are the results that I get on the rep-1 pool.
>>> 
>>> ```
>>> IO500 version io500-sc22_v2 (standard)
>>> [RESULT]       ior-easy-write       78.772830 GiB/s : time 97.098 seconds
>>> [INVALID]
>>> [RESULT]    mdtest-easy-write       37.375945 kIOPS : time 870.934
>> seconds
>>> [      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
>>> [RESULT]       ior-hard-write        2.242241 GiB/s : time 35.431 seconds
>>> [INVALID]
>>> [RESULT]    mdtest-hard-write        2.575028 kIOPS : time 57.697 seconds
>>> [INVALID]
>>> [RESULT]                 find     1072.770588 kIOPS : time 30.441 seconds
>>> [RESULT]        ior-easy-read       64.118118 GiB/s : time 118.982
>> seconds
>>> [RESULT]     mdtest-easy-stat      154.903631 kIOPS : time 210.887
>> seconds
>>> [RESULT]        ior-hard-read        4.285418 GiB/s : time 18.474 seconds
>>> [RESULT]     mdtest-hard-stat       40.126159 kIOPS : time 4.646 seconds
>>> [RESULT]   mdtest-easy-delete       39.296673 kIOPS : time 839.509
>> seconds
>>> [RESULT]     mdtest-hard-read       17.161306 kIOPS : time 9.505 seconds
>>> [RESULT]   mdtest-hard-delete        4.771440 kIOPS : time 31.931 seconds
>>> [SCORE ] Bandwidth 14.842537 GiB/s : IOPS 34.623082 kiops : TOTAL
>> 22.669239
>>> [INVALID]
>>> ```
>>> 
>>> I wonder whether I missed any tuning parameters or other "secret sauce"
>>> that enabled the results from [2]:
>>> 
>>> ```
>>> [RESULT] BW   phase 1            ior_easy_write               36.255
>> GiB/s
>>> : time 387.94 seconds
>>> [RESULT] IOPS phase 1         mdtest_easy_write              191.980
>> kiops
>>> : time 450.05 seconds
>>> [RESULT] BW   phase 2            ior_hard_write                9.137
>> GiB/s
>>> : time 301.21 seconds
>>> [RESULT] IOPS phase 2         mdtest_hard_write               17.187
>> kiops
>>> : time 393.55 seconds
>>> [RESULT] IOPS phase 3                      find              965.790
>> kiops
>>> : time  96.46 seconds
>>> [RESULT] BW   phase 3             ior_easy_read               75.621
>> GiB/s
>>> : time 185.75 seconds
>>> [RESULT] IOPS phase 4          mdtest_easy_stat              903.112
>> kiops
>>> : time  95.67 seconds
>>> [RESULT] BW   phase 4             ior_hard_read               19.080
>> GiB/s
>>> : time 144.22 seconds
>>> [RESULT] IOPS phase 5          mdtest_hard_stat               97.399
>> kiops
>>> : time  69.44 seconds
>>> [RESULT] IOPS phase 6        mdtest_easy_delete              123.455
>> kiops
>>> : time 699.85 seconds
>>> [RESULT] IOPS phase 7          mdtest_hard_read               87.512
>> kiops
>>> : time  77.29 seconds
>>> [RESULT] IOPS phase 8        mdtest_hard_delete               18.814
>> kiops
>>> : time 390.91 seconds
>>> [SCORE] Bandwidth 26.2933 GiB/s : IOPS 124.297 kiops : TOTAL 57.168
>>> ```
>>> 
>>> It looks like my results are more in the same order as the SUSE results
>>> from 2019 [3].
>>> 
>>> ```
>>> [RESULT] BW   phase 1            ior_easy_write               16.072
>> GB/s :
>>> time 347.39 seconds
>>> [RESULT] IOPS phase 1         mdtest_easy_write               32.822
>> kiops
>>> : time 365.67 seconds
>>> [RESULT] BW   phase 2            ior_hard_write                1.572
>> GB/s :
>>> time 359.20 seconds
>>> [RESULT] IOPS phase 2         mdtest_hard_write               12.917
>> kiops
>>> : time 317.70 seconds
>>> [RESULT] IOPS phase 3                      find              250.500
>> kiops
>>> : time  64.28 seconds
>>> [RESULT] BW   phase 3             ior_easy_read                9.139
>> GB/s :
>>> time 600.48 seconds
>>> [RESULT] IOPS phase 4          mdtest_easy_stat              127.919
>> kiops
>>> : time  93.82 seconds
>>> [RESULT] BW   phase 4             ior_hard_read                4.698
>> GB/s :
>>> time 120.17 seconds
>>> [RESULT] IOPS phase 5          mdtest_hard_stat               68.791
>> kiops
>>> : time  59.65 seconds
>>> [RESULT] IOPS phase 6        mdtest_easy_delete               20.845
>> kiops
>>> : time 575.70 seconds
>>> [RESULT] IOPS phase 7          mdtest_hard_read               41.640
>> kiops
>>> : time  98.55 seconds
>>> [RESULT] IOPS phase 8        mdtest_hard_delete                6.224
>> kiops
>>> : time 660.50 seconds
>>> [SCORE] Bandwidth 5.73936 GB/s : IOPS 38.7169 kiops : TOTAL 14.9067
>>> ```
>>> 
>>> One difference I could find is that the Red Hat results use the CEPHFS
>>> backend of IO500 (that I cannot get to work properly because of a crash
>>> "Caught signal 11 (Segmentation fault: address not mapped to object at
>>> address (nil))" in libucs.so. SUSE used the POSIX backend.
>>> 
>>> Changing from 1 OSD server per NVME to 2 did not help too much either.
>>> 
>>> Maybe someone on the list has an idea for something else to try?
>>> 
>>> Oh, in case anyone is interested, here are some results using the rep-2,
>>> rep-3, and ec-8-2 pool.
>>> 
>>> ```
>>>                       *** pool=rep-2 NP=360 ***
>>> IO500 version io500-sc22_v2 (standard)
>>> [RESULT]       ior-easy-write       39.613736 GiB/s : time 153.508
>> seconds
>>> [INVALID]
>>> [RESULT]    mdtest-easy-write       13.932462 kIOPS : time 38.119 seconds
>>> [INVALID]
>>> [      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
>>> [RESULT]       ior-hard-write        1.809117 GiB/s : time 39.019 seconds
>>> [INVALID]
>>> [RESULT]    mdtest-hard-write        4.925225 kIOPS : time 37.654 seconds
>>> [INVALID]
>>> [RESULT]                 find       69.063353 kIOPS : time 9.042 seconds
>>> [RESULT]        ior-easy-read       59.503973 GiB/s : time 102.166
>> seconds
>>> [RESULT]     mdtest-easy-stat      143.589003 kIOPS : time 4.097 seconds
>>> [RESULT]        ior-hard-read        4.104325 GiB/s : time 14.868 seconds
>>> [RESULT]     mdtest-hard-stat      156.252159 kIOPS : time 2.204 seconds
>>> [RESULT]   mdtest-easy-delete       35.312782 kIOPS : time 14.249 seconds
>>> [RESULT]     mdtest-hard-read       67.097465 kIOPS : time 3.739 seconds
>>> [RESULT]   mdtest-hard-delete       11.018869 kIOPS : time 18.060 seconds
>>> [SCORE ] Bandwidth 11.502045 GiB/s : IOPS 35.927582 kiops : TOTAL
>> 20.328322
>>> [INVALID]
>>> 
>>> 
>>>                       *** pool=rep-3 NP=360 ***
>>> IO500 version io500-sc22_v2 (standard)
>>> [RESULT]       ior-easy-write       27.481332 GiB/s : time 204.973
>> seconds
>>> [INVALID]
>>> [RESULT]    mdtest-easy-write       27.699574 kIOPS : time 1502.596
>> seconds
>>> [      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
>>> [RESULT]       ior-hard-write        1.352186 GiB/s : time 38.273 seconds
>>> [INVALID]
>>> [RESULT]    mdtest-hard-write        3.024279 kIOPS : time 48.923 seconds
>>> [INVALID]
>>> [RESULT]                 find      777.440295 kIOPS : time 53.684 seconds
>>> [RESULT]        ior-easy-read       58.686272 GiB/s : time 95.992 seconds
>>> [RESULT]     mdtest-easy-stat      156.499256 kIOPS : time 266.755
>> seconds
>>> [RESULT]        ior-hard-read        4.095575 GiB/s : time 12.649 seconds
>>> [RESULT]     mdtest-hard-stat       62.831560 kIOPS : time 3.318 seconds
>>> [RESULT]   mdtest-easy-delete       25.909017 kIOPS : time 1606.960
>> seconds
>>> [RESULT]     mdtest-hard-read       16.586529 kIOPS : time 9.735 seconds
>>> [RESULT]   mdtest-hard-delete        9.093536 kIOPS : time 18.615 seconds
>>> [SCORE ] Bandwidth 9.721458 GiB/s : IOPS 35.464915 kiops : TOTAL
>> 18.568002
>>> [INVALID]
>>> 
>>> 
>>>                       *** pool=ec-8-2 NP=360 ***
>>> IO500 version io500-sc22_v2 (standard)
>>> [RESULT]       ior-easy-write       40.480456 GiB/s : time 151.451
>> seconds
>>> [INVALID]
>>> [RESULT]    mdtest-easy-write       32.507690 kIOPS : time 444.424
>> seconds
>>> [      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
>>> [RESULT]       ior-hard-write        0.570092 GiB/s : time 35.986 seconds
>>> [INVALID]
>>> [RESULT]    mdtest-hard-write        3.287144 kIOPS : time 40.114 seconds
>>> [INVALID]
>>> [RESULT]                 find     1779.068273 kIOPS : time 8.177 seconds
>>> [RESULT]        ior-easy-read       56.463968 GiB/s : time 108.661
>> seconds
>>> [RESULT]     mdtest-easy-stat      179.334380 kIOPS : time 81.380 seconds
>>> [RESULT]        ior-hard-read        1.957840 GiB/s : time 10.484 seconds
>>> [RESULT]     mdtest-hard-stat       92.430508 kIOPS : time 2.402 seconds
>>> [RESULT]   mdtest-easy-delete       29.549239 kIOPS : time 489.285
>> seconds
>>> [RESULT]     mdtest-hard-read       26.989114 kIOPS : time 5.770 seconds
>>> [RESULT]   mdtest-hard-delete       26.500674 kIOPS : time 6.038 seconds
>>> [SCORE ] Bandwidth 7.106974 GiB/s : IOPS 53.448254 kiops : TOTAL
>> 19.489879
>>> [INVALID]
>>> ```
>>> 
>>> Best wishes,
>>> Manuel
>>> 
>>> [1] https://croit.io/blog/ceph-performance-test-and-optimization
>>> [2] https://io500.org/submissions/view/82
>>> [3] https://io500.org/submissions/view/141
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux