Hi Mark and all, The key point is to consider your users' write requirements: do your applications need to write concurrently to the same file from several cephfs mounts? or does each job write to a separate file? If your use-case is predominantly the latter, you'll have a lot of success right now with CephFS out of the box, using the kernel client for near wire-speed throughputs. But if your use-case is the former, then read on... Parallel writing to a single file -- tested in those "hard" IO500 tests -- really benefits from Ceph's lazy io feature. Lazy IO is a hint to the MDS that the applications can be trusted to coordinate among themselves to maintain an individual file's consistency; without it, CephFS only allows one writer at a time and this can cause significant slowdowns in parallel HPC applications. We've demonstrated real-world speedups in this area; see from slide 26: https://hps.vi4io.org/_media/events/2019/hpc-iodc-cephfs-chiusole.pdf Last year we patched ior to enable lazy io, which should eventually let CephFS achieve some really competitive IO500 results. But that triggers a caps release issue (https://tracker.ceph.com/issues/44166) and we didn't progress further. We'll try that test again with a recent client kernel to see if it works now. Anyway, now that CephFS is becoming more and more attractive for HPC installations, I hope this feature gets more visibility and user interest. Christof, Manuel, and anyone else needing it, please give Lazy IO a try and share your results. It would be great if we can solve the few remaining bugs, as well as improving the user-friendliness of how it is enabled per mount/dir/file. IMHO this would really improve CephFS's position in the HPC area. Cheers, Dan On Wed, Jul 21, 2021 at 6:36 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > Hi Manuel, > > > I was the one that did Red Hat's IO500 CephFS submission. Feel free to > ask any questions you like. Generally speaking I could achieve 3GB/s > pretty easily per kernel client and up to about 8GB/s per client with > libcephfs directly (up to the aggregate cluster limits assuming enough > concurrency). Metadata is trickier. The fastest option is if you have > files spread across directories that you can manually pin round-robin to > MDSes, though you can do somewhat well with ephemeral pinning too as a > more automatic option. If you have lots of clients dealing with lots of > files in a single directory, that's where you revert to dynamic subtree > partitioning which tends to be quite a bit slower (though at least some > of this is due to journaling overhead on the auth MDS). That's > especially true if you have a significant number of active/active MDS > servers (say 10-20+). We tended to consistently do very well with the > "easy" IO500 tests and struggled more with the "hard" tests. Otherwise > most of the standard Ceph caveats apply. Replication eats into write > performance, scrub/deep scrub can impact performance, choosing the right > NVMe drive with power less protection and low overhead is important, etc. > > > Probably the most important questions you should be asking yourself is > how you intend to use the storage, what do you need out of it, and what > you need to do to get there. Ceph has a lot of advantages regarding > replication, self-healing, and consistency and it's quite fast for some > workloads given those advantages. There are some workloads though (say > unaligned small writes from hundreds of clients to random files in a > single directory) that potentially could be pretty slow. > > > Mark > > > On 7/21/21 8:54 AM, Manuel Holtgrewe wrote: > > Dear all, > > > > we are looking towards setting up an all-NVME CephFS instance in our > > high-performance compute system. Does anyone have any experience to share > > in a HPC setup or an NVME setup mounted by dozens of nodes or more? > > > > I've followed the impressive work done at CERN on Youtube but otherwise > > there appear to be only few places using CephFS this way. There are a few > > of CephFS-as-enterprise-storage vendors that sporadically advertise CephFS > > for HPC but it does not appear to be a strategic main target for them. > > > > I'd be happy to read about your experience/opinion on CephFS for HPC. > > > > Best wishes, > > Manuel > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx