Hi Manuel! I'm the one that submitted the the io500 results for Red Hat.Ok, so a couple of things. First be aware that vendors are not required to use any form of replication whatsoever for the IO500 test submissions. Our results are thus using 1x replication. :) But! 2x should only hurt your write numbers, and maybe not as much if you are front-end network limited. EC will likely hurt fairly badly.
Next, I used ephemeral pinning for the easy tests with something like 30-40 active/active MDS. I tested up to 100 (10 per server!) but after about 30-40 contention starts becoming a really big problem. It helped improve performance on the easy tests with ephemeral pinning but actually hurt on the hard tests where we have to split and export dirfrags.
AMD (dual socket) nodes may be a bit more challenging. There was a presentation at the Ceph BOF at SC2021 by Andra Pataki at the Flatiron Instute about running Ceph on dual socket AMD Rome setups with large numbers of NVMe drives. He explained that they were seeing performance variations when running all of the NVMe drives together and believes they tracked it down to the PCIe scheduling issues. He noted that they could get good consistent throughput to one device if the "Preferred I/O" bios setting was set, but at even further expense to all of the other devices in the system. I'm not sure if this is simply due to using some of the PCIe lanes for cross-socket communication or a bigger issue with the controllers.
Finally, the results I got across many tests were fairly chaotic. When you have 20-30 active/active MDSes the (hard) benchmark results end up dominated by how quickly dynamic subtree partitioning takes place, and it turns out that I could essentially DDOS the authoritative MDS for the parent directory preventing dirfrag exports from happening at all. Even after the splits take place, there are these performance oscillations that may stem from continued dirfrag splits (or something else!). I also saw that the primary MDS was extremely busy with encoding of subtrees for journal writes. I had an experimental PR floating around that pre-created and exported dirfrags based on an "expected size" directory xattr but we never ended up merging it. Zheng also made a PR to remove the subtree map from journal updates that may be a big win here too, but that also never merged.
Finally the other big limitation I saw was that we had extremely inconsistent client performance and the IO500 penalizes you heavily for it. Some clients were taking nearly twice as long to complete their work as others and this can really drag your score down.
Ultimately the results that were sent to the IO500 were the best from something like 10 practice runs (and that was after something like 200 debugging runs) and there was pretty high variation between them. I was pretty happy with the numbers that we got, but if could fix some of the issues encountered I suspect that we could have gotten an overall score 2x-3x higher and likely far more consistently.
Mark On 4/13/22 04:56, Manuel Holtgrewe wrote:
Dear all, I want to revive this old thread. I now have the following hardware at hand: - 8 servers with - AMD 7413 processors (24 cores, 48 hw threads) - 260 GB of RAM - 10 NVMEs each - dual 25GbE towards clients and - dual 100GbE for the cluster network The 2x25GbE NICs go into a dedicated VLT switch pair which is connected upstream with an overall of 4x100GbE DAC so the network path from the ceph cluster to my clients is limited to about 40GB/sec in network bandwidth. I realize that the docs recommend against splitting frontend/cluster network but I can change this later on. My focus is not on having multiple programs writing to the same file so lazy I/O is not that interesting for me I believe. I would be happy to later run a patched version of io500, though. I found [1] from croit to be quite useful. The vendor gives 3.5GB/sec sequential write and 7GB/sec sequential read per disk (which I usually read as "it won't get faster but real-world performance is a different pair of shoes"). So the theoretical maximum per node would be 35GB/sec write and 70GB/sec read which I will never achieve, but also the 2x25GbE network should not be saturated immediately either. I've done some fio benchmarks of the NVMEs and get ~2GB/sec write performance when run like this for NP=16 threads. I have attached a copy of the output. ``` # fio --filename=/dev/nvme0n1 --direct=1 --fsync=1 --rw=write --bs=4k --numjobs=$NP \ --iodepth=1 --runtime=60 --time_based --group_reporting --name=4k-sync-write-$NP ``` I'm running Ceph 16.2.7 deployed via cephadm. Client and server run on Rocky Linux 8, clients currently run kernel 4.18.0-348.7.1.el8_5.x86_64 while the servers run 4.18.0-348.2.1.el8_5.x86_64. I changed the following configuration from their defaults: osd_memory_target osd:16187498659 mds_cache_memory_limit global:12884901888 I created three pools: 3rep 2rep and ec(6,2) and ran the fio benchmark with the rbd backend. Performance degraded with multiple threads, the results for one thread are attached. Shortly. 3-rep WRITE: bw=11.0MiB/s (12.5MB/s), 11.0MiB/s-11.0MiB/s (12.5MB/s-12.5MB/s), io=718MiB (753MB), run=60001-60001msec 2-rep WRITE: bw=11.4MiB/s (11.0MB/s), 11.4MiB/s-11.4MiB/s (11.0MB/s-11.0MB/s), io=686MiB (719MB), run=60001-60001msec ec(6,2) WRITE: bw=5825KiB/s (5965kB/s), 5825KiB/s-5825KiB/s (5965kB/s-5965kB/s), io=341MiB (358MB), run=60001-60001msec I then attempted to run the IO500 benchmark with the easy tests only for now and using stonewalltime=30. I'm currently running 14 MDS with 2 standby and one OSD per NVME. 1 thread on one client node [RESULT] ior-easy-write 1.970705 GiB/s : time 36.122 seconds [INVALID] [RESULT] ior-easy-read 1.604711 GiB/s : time 44.315 seconds 2 threads on 1 client node [RESULT] ior-easy-write 2.381101 GiB/s : time 57.877 seconds [INVALID] [RESULT] ior-easy-read 2.863261 GiB/s : time 48.132 seconds 4 threads on 1 client node [RESULT] ior-easy-write 2.565519 GiB/s : time 74.096 seconds [INVALID] [RESULT] ior-easy-read 4.076592 GiB/s : time 46.630 seconds 10 threads on 1 client node [RESULT] ior-easy-write 2.627818 GiB/s : time 75.594 seconds [INVALID] [RESULT] ior-easy-read 4.628728 GiB/s : time 42.913 seconds 20 threads on 2 client nodes (10 threads each) [RESULT] ior-easy-write 5.075989 GiB/s : time 80.186 seconds [INVALID] [RESULT] ior-easy-read 8.605426 GiB/s : time 47.299 seconds 40 threads on 4 client nodes (10 threads each) [RESULT] ior-easy-write 10.108431 GiB/s : time 81.134 seconds [INVALID] [RESULT] ior-easy-read 10.845139 GiB/s : time 75.460 seconds 80 threads on 8 client nodes (10 threads each) [RESULT] ior-easy-write 10.378105 GiB/s : time 128.358 seconds [INVALID] [RESULT] ior-easy-read 10.898354 GiB/s : time 122.212 seconds So it looks like we can saturate a single client node for writing with ~4 threads already right now. What could good next steps be? I'd try the following in the given order. But maybe I'm missing something basic in terms of settings (sysctl settings are also attached). - I could try to increase the number of OSDs per physical disk and try out 2, 4, 8, or even 16 OSDs per NVME as my fio benchmarks indicate that I even get speedups when going from 8 to 16 threads in fio. - I could switch to using the 2x100GbE only but 10GB/sec does not feel like the network is the bottleneck. - I could try to increase the MDS count but that should only affect the meta data operations at this point, I believe. - Probably it is not worth looking into lazy I/O for the ior-easy-* benchmarks. I could try to tinker around with the BIOS settings some more, in particular in the area of the AMD NUMA domains... but my gut feeling tells me that this influence should be <<10% and I'd like to see a much better improvement. With 80 NVMes disks, I should be limited to 160GB/sec from the disks. The 3-rep pool will need to write everyhing three times so the limit is at most ~53GB/sec (which is above the theoretical bandwidth of the network path). Does anyone else have a similar build for CephFS and can tell me what I can hope for? Naively, I would expect my disks to saturate the network. Also, considering the IO500 entry for Ceph from 2020 [2], I would expect better performance. They got a write bandwidth of 36GB/sec and read bandwidth of 75GB/sec with 10 server nodes and 8 disks each, if I understand correctly, I assume they used 2-rep, though. They used 10 client nodes with 32 threads each... Best wishes, Manuel [1] https://croit.io/blog/ceph-performance-test-and-optimization [2] https://io500.org/submissions/view/82 On Tue, Jul 27, 2021 at 7:21 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:Hi Manuel, I will reply inline below. On 7/27/21 11:10 AM, Manuel Holtgrewe wrote:Hi, thank you very much for the detailed response. We are in the area of life sciences so our work loads are a mixed bag of high-throughput computing. I'm seeing the following in our current storage system and have some questions connected to this: 1. All users like to install software with conda, which means I have hundreds/thousands of Linux File Hierarchy Structure directories with small files but there are not too many files here and these directory trees are created once and then only read before they are deleted.So this sounds like a pretty good candidate for ephemeral pinning. The idea is that when a new directory is created it is pseudorandomly assigned to an MDS. This works well if you have lots of directories so work can be spread across MDSes without having to do a bunch of work figuring out how to split directories and distribute the work on the fly.2. Users have many large files that are written sequentially and users then perform something like an 80/20% mix of sequential/random access reads on them.This generally should be pretty good as long as it's not something like small unaligned sequential sync writes or something. We were seeing that kind of pattern with etcd's journal and it we weren't nearly as fast as something like a local NVMe drive (but I don't many any other distributed file systems are likely to do that well either).3. Other users like to have directory trees with many small files (the human genome has about 20k genes and they like to create one file per gene). I can probably teach them to create sub directories with the first two characters of each file to cut down on these huge directories.So the gist of it is that you probably aren't going to see more than about 10-20K IOPs per MDS server. IE if you put those 20K files in one directory with ephemeral pinning you aren't going to get more performance for that single dataset. If you use dynamic subtree partitioning you can theoretically do better because cephfs will try to distribute the work to other MDSes, but in practice this might not work well, especially if you have a really huge directory.3.1 Is there any guidance on what number of entries per directories is still OK and when it starts being problematic?So unfortunately it's really complicated. If you are using ephemeral pinning I suspect it's larger, but everything in the directory will basically always be served by a single MDS (the benefit of which is that there's no background work going on to try to move workload around). With dynamic subtree partitioning it seems to be a function of how many subtrees you have, how many MDSes, how likely it is to hit lock contention, how many files, and various other factors.3.2 What's the problem with many writes to random files in one directory?I think this was more about lots of writers all writing to files in the same directory especially when using dynamic subtree partitioning and having many subtrees causing journaling to slow down on the auth MDS.4. Still other users either have large HDF5 files that they like to perform random reads within them. 5. Still other users like to have many small files and randomly read some of them. 6. Almost nobody uses MPI and almost nobody writes to the same file at the same time.Sounds like a pretty standard distribution of use cases.We would not run the system in a "hyperconverged" way but rather build a separate Ceph cluster from them. If I understand correctly, it looks like CephFS would be a good fit and except for problems with directories having (too) many files we would be within the comfort zone of CephFS. If I also understand correctly, the optimizations that people point to as experimental/fringe such as lazy I/O will not be of much help with our use case.My personal take is that for standard large read/write workloads cephfs should do pretty well. I think I was getting something like 60-70GB/s reads across our test performance cluster (but at least for me I was limited to about 3GB/s per client when using kernel cephfs, 8GB/s per client when using libcephfs directly). if you can make good use of ephemeral pinning you can scale aggregate small random performance alright (~10-20K IOPS per MDS, I was getting something like ~100-200K aggregate in my test setup with ephemeral pinning). The cases where the entire workload is in one directory or where you have many clients all reading/writing to a single file are where we struggle imho. I saw much more volatile performance in those tests.What do you mean when you are referring to "power loss protection"? I assume that this "only" protects against data corruption and does not bring advantages in terms of performance. We're looking at buying enterprise grade NVMEs in any case. I just checked and the options we are following all have that feature.A side benefit of many power loss protection implementations is that the drive has enough juice to basically ignore cache flushes and just sync writes to flash when it's optimal for the drive. IE if you do a sync write at the block level, the kernel will tell the drive to do something like an ATA_CMD_FLUSH (or whatever the equiv nvme command is). With proper PLP, it can just ignore that and keep it in cache because if the drive loses power it can still sync it to disk after the fact. Without power loss protection, for the drive to be safe, it actually has to write the data out to the flash cell.Thank you again and best wishes, Manuel On Wed, Jul 21, 2021 at 6:36 PM Mark Nelson <mnelson@xxxxxxxxxx <mailto:mnelson@xxxxxxxxxx>> wrote: Hi Manuel, I was the one that did Red Hat's IO500 CephFS submission. Feel free to ask any questions you like. Generally speaking I could achieve 3GB/s pretty easily per kernel client and up to about 8GB/s per client with libcephfs directly (up to the aggregate cluster limits assuming enough concurrency). Metadata is trickier. The fastest option is if you have files spread across directories that you can manually pin round-robin to MDSes, though you can do somewhat well with ephemeral pinning too as a more automatic option. If you have lots of clients dealing with lots of files in a single directory, that's where you revert to dynamic subtree partitioning which tends to be quite a bit slower (though at least some of this is due to journaling overhead on the auth MDS). That's especially true if you have a significant number of active/active MDS servers (say 10-20+). We tended to consistently do very well with the "easy" IO500 tests and struggled more with the "hard" tests. Otherwise most of the standard Ceph caveats apply. Replication eats into write performance, scrub/deep scrub can impact performance, choosing the right NVMe drive with power less protection and low overhead is important, etc. Probably the most important questions you should be asking yourself is how you intend to use the storage, what do you need out of it, and what you need to do to get there. Ceph has a lot of advantages regarding replication, self-healing, and consistency and it's quite fast for some workloads given those advantages. There are some workloads though (say unaligned small writes from hundreds of clients to random files in a single directory) that potentially could be pretty slow. Mark On 7/21/21 8:54 AM, Manuel Holtgrewe wrote: > Dear all, > > we are looking towards setting up an all-NVME CephFS instance inour> high-performance compute system. Does anyone have any experience to share > in a HPC setup or an NVME setup mounted by dozens of nodes or more? > > I've followed the impressive work done at CERN on Youtube but otherwise > there appear to be only few places using CephFS this way. There are a few > of CephFS-as-enterprise-storage vendors that sporadically advertise CephFS > for HPC but it does not appear to be a strategic main target for them. > > I'd be happy to read about your experience/opinion on CephFS for HPC. > > Best wishes, > Manuel > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> > To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx