Re: Using CephFS in High Performance (and Throughput) Compute Use Cases

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 22 Jul 2021 09:00:30 -0500

Hi Dan,

Ah, that's fantastic regarding IOR.  Have you tried the libcephfs 
backend?  That might be another route for easy testing (and at least on 
our previous test setup I saw higher large sequential IO throughput with 
it vs the kernel client).  Lazy IO is definitely worth it if you have an 
application that can safely make use of it.  It should definitely help 
with IOR hard, but I suspect we will still need to work on improving the 
other issues I mentioned for mdtest hard.

Definitely keep testing and giving feedback!  It's exciting to here 
someone from CERN say CephFS is becoming more and more attractive for 
HPC!  :)

Mark

On 7/22/21 3:13 AM, Dan van der Ster wrote:
Hi Mark and all,

The key point is to consider your users' write requirements: do your
applications need to write concurrently to the same file from several
cephfs mounts? or does each job write to a separate file?
If your use-case is predominantly the latter, you'll have a lot of
success right now with CephFS out of the box, using the kernel client
for near wire-speed throughputs.

But if your use-case is the former, then read on...

Parallel writing to a single file -- tested in those "hard" IO500
tests -- really benefits from Ceph's lazy io feature.
Lazy IO is a hint to the MDS that the applications can be trusted to
coordinate among themselves to maintain an individual file's
consistency; without it, CephFS only allows one writer at a time and
this can cause significant slowdowns in parallel HPC applications.
We've demonstrated real-world speedups in this area; see from slide
26: https://hps.vi4io.org/_media/events/2019/hpc-iodc-cephfs-chiusole.pdf

Last year we patched ior to enable lazy io, which should eventually
let CephFS achieve some really competitive IO500 results. But that
triggers a caps release issue (https://tracker.ceph.com/issues/44166)
and we didn't progress further. We'll try that test again with a
recent client kernel to see if it works now.

Anyway, now that CephFS is becoming more and more attractive for HPC
installations, I hope this feature gets more visibility and user
interest. Christof, Manuel, and anyone else needing it, please give
Lazy IO a try and share your results.

It would be great if we can solve the few remaining bugs, as well as
improving the user-friendliness of how it is enabled per
mount/dir/file. IMHO this would really improve CephFS's position in
the HPC area.

Cheers, Dan

On Wed, Jul 21, 2021 at 6:36 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
Hi Manuel,

I was the one that did Red Hat's IO500 CephFS submission.  Feel free to
ask any questions you like.  Generally speaking I could achieve 3GB/s
pretty easily per kernel client and up to about 8GB/s per client with
libcephfs directly (up to the aggregate cluster limits assuming enough
concurrency).  Metadata is trickier.  The fastest option is if you have
files spread across directories that you can manually pin round-robin to
MDSes, though you can do somewhat well with ephemeral pinning too as a
more automatic option.  If you have lots of clients dealing with lots of
files in a single directory, that's where you revert to dynamic subtree
partitioning which tends to be quite a bit slower (though at least some
of this is due to journaling overhead on the auth MDS).  That's
especially true if you have a significant number of active/active MDS
servers (say 10-20+).  We tended to consistently do very well with the
"easy" IO500 tests and struggled more with the "hard" tests.  Otherwise
most of the standard Ceph caveats apply.  Replication eats into write
performance, scrub/deep scrub can impact performance, choosing the right
NVMe drive with power less protection and low overhead is important, etc.

Probably the most important questions you should be asking yourself is
how you intend to use the storage, what do you need out of it, and what
you need to do to get there.  Ceph has a lot of advantages regarding
replication, self-healing, and consistency and it's quite fast for some
workloads given those advantages. There are some workloads though (say
unaligned small writes from hundreds of clients to random files in a
single directory) that potentially could be pretty slow.

Mark

On 7/21/21 8:54 AM, Manuel Holtgrewe wrote:
Dear all,

we are looking towards setting up an all-NVME CephFS instance in our
high-performance compute system. Does anyone have any experience to share
in a HPC setup or an NVME setup mounted by dozens of nodes or more?

I've followed the impressive work done at CERN on Youtube but otherwise
there appear to be only few places using CephFS this way. There are a few
of CephFS-as-enterprise-storage vendors that sporadically advertise CephFS
for HPC but it does not appear to be a strategic main target for them.

I'd be happy to read about your experience/opinion on CephFS for HPC.

Best wishes,
Manuel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx