Re: Using CephFS in High Performance (and Throughput) Compute Use Cases

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 22 Jul 2021 10:13:42 +0200

Hi Mark and all,

The key point is to consider your users' write requirements: do your
applications need to write concurrently to the same file from several
cephfs mounts? or does each job write to a separate file?
If your use-case is predominantly the latter, you'll have a lot of
success right now with CephFS out of the box, using the kernel client
for near wire-speed throughputs.

But if your use-case is the former, then read on...

Parallel writing to a single file -- tested in those "hard" IO500
tests -- really benefits from Ceph's lazy io feature.
Lazy IO is a hint to the MDS that the applications can be trusted to
coordinate among themselves to maintain an individual file's
consistency; without it, CephFS only allows one writer at a time and
this can cause significant slowdowns in parallel HPC applications.
We've demonstrated real-world speedups in this area; see from slide
26: https://hps.vi4io.org/_media/events/2019/hpc-iodc-cephfs-chiusole.pdf

Last year we patched ior to enable lazy io, which should eventually
let CephFS achieve some really competitive IO500 results. But that
triggers a caps release issue (https://tracker.ceph.com/issues/44166)
and we didn't progress further. We'll try that test again with a
recent client kernel to see if it works now.

Anyway, now that CephFS is becoming more and more attractive for HPC
installations, I hope this feature gets more visibility and user
interest. Christof, Manuel, and anyone else needing it, please give
Lazy IO a try and share your results.

It would be great if we can solve the few remaining bugs, as well as
improving the user-friendliness of how it is enabled per
mount/dir/file. IMHO this would really improve CephFS's position in
the HPC area.

Cheers, Dan

On Wed, Jul 21, 2021 at 6:36 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
> Hi Manuel,
>
>
> I was the one that did Red Hat's IO500 CephFS submission.  Feel free to
> ask any questions you like.  Generally speaking I could achieve 3GB/s
> pretty easily per kernel client and up to about 8GB/s per client with
> libcephfs directly (up to the aggregate cluster limits assuming enough
> concurrency).  Metadata is trickier.  The fastest option is if you have
> files spread across directories that you can manually pin round-robin to
> MDSes, though you can do somewhat well with ephemeral pinning too as a
> more automatic option.  If you have lots of clients dealing with lots of
> files in a single directory, that's where you revert to dynamic subtree
> partitioning which tends to be quite a bit slower (though at least some
> of this is due to journaling overhead on the auth MDS).  That's
> especially true if you have a significant number of active/active MDS
> servers (say 10-20+).  We tended to consistently do very well with the
> "easy" IO500 tests and struggled more with the "hard" tests.  Otherwise
> most of the standard Ceph caveats apply.  Replication eats into write
> performance, scrub/deep scrub can impact performance, choosing the right
> NVMe drive with power less protection and low overhead is important, etc.
>
>
> Probably the most important questions you should be asking yourself is
> how you intend to use the storage, what do you need out of it, and what
> you need to do to get there.  Ceph has a lot of advantages regarding
> replication, self-healing, and consistency and it's quite fast for some
> workloads given those advantages. There are some workloads though (say
> unaligned small writes from hundreds of clients to random files in a
> single directory) that potentially could be pretty slow.
>
>
> Mark
>
>
> On 7/21/21 8:54 AM, Manuel Holtgrewe wrote:
> > Dear all,
> >
> > we are looking towards setting up an all-NVME CephFS instance in our
> > high-performance compute system. Does anyone have any experience to share
> > in a HPC setup or an NVME setup mounted by dozens of nodes or more?
> >
> > I've followed the impressive work done at CERN on Youtube but otherwise
> > there appear to be only few places using CephFS this way. There are a few
> > of CephFS-as-enterprise-storage vendors that sporadically advertise CephFS
> > for HPC but it does not appear to be a strategic main target for them.
> >
> > I'd be happy to read about your experience/opinion on CephFS for HPC.
> >
> > Best wishes,
> > Manuel
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx