Re: Using CephFS in High Performance (and Throughput) Compute Use Cases

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 13 Apr 2022 07:49:02 -0500

Hi Manuel!

I'm the one that submitted the the io500 results for Red Hat.

Ok, so a couple of things.  First be aware that vendors are not required 
to use any form of replication whatsoever for the IO500 test 
submissions.  Our results are thus using 1x replication. :) But!  2x 
should only hurt your write numbers, and maybe not as much if you are 
front-end network limited.  EC will likely hurt fairly badly.

Next, I used ephemeral pinning for the easy tests with something like 
30-40 active/active MDS.  I tested up to 100 (10 per server!) but after 
about 30-40 contention starts becoming a really big problem.  It helped 
improve performance on the easy tests with ephemeral pinning but 
actually hurt on the hard tests where we have to split and export dirfrags.

AMD (dual socket) nodes may be a bit more challenging.  There was a 
presentation at the Ceph BOF at SC2021 by Andra Pataki at the Flatiron 
Instute about running Ceph on dual socket AMD Rome setups with large 
numbers of NVMe drives.  He explained that they were seeing performance 
variations when running all of the NVMe drives together and believes 
they tracked it down to the PCIe scheduling issues.  He noted that they 
could get good consistent throughput to one device if the "Preferred 
I/O" bios setting was set, but at even further expense to all of the 
other devices in the system. I'm not sure if this is simply due to using 
some of the PCIe lanes for cross-socket communication or a bigger issue 
with the controllers.

Finally, the results I got across many tests were fairly chaotic.  When 
you have 20-30 active/active MDSes the (hard) benchmark results end up 
dominated by how quickly dynamic subtree partitioning takes place, and 
it turns out that I could essentially DDOS the authoritative MDS for the 
parent directory preventing dirfrag exports from happening at all.  Even 
after the splits take place, there are these performance oscillations 
that may stem from continued dirfrag splits (or something else!).  I 
also saw that the primary MDS was extremely busy with encoding of 
subtrees for journal writes.  I had an experimental PR floating around 
that pre-created and exported dirfrags based on an "expected size" 
directory xattr but we never ended up merging it. Zheng also made a PR 
to remove the subtree map from journal updates that may be a big win 
here too, but that also never merged.

Finally the other big limitation I saw was that we had extremely 
inconsistent client performance and the IO500 penalizes you heavily for 
it.  Some clients were taking nearly twice as long to complete their 
work as others and this can really drag your score down.

Ultimately the results that were sent to the IO500 were the best from 
something like 10 practice runs (and that was after something like 200 
debugging runs) and there was pretty high variation between them.  I was 
pretty happy with the numbers that we got, but if could fix some of the 
issues encountered I suspect that we could have gotten an overall score 
2x-3x higher and likely far more consistently.

Mark

On 4/13/22 04:56, Manuel Holtgrewe wrote:
Dear all,

I want to revive this old thread. I now have the following hardware at hand:

- 8 servers with
- AMD 7413 processors (24 cores, 48 hw threads)
- 260 GB of RAM
- 10 NVMEs each
- dual 25GbE towards clients and
- dual 100GbE for the cluster network

The 2x25GbE NICs go into a dedicated VLT switch pair which is connected
upstream with an overall of 4x100GbE DAC so the network path from the ceph
cluster to my clients is limited to about 40GB/sec in network bandwidth.

I realize that the docs recommend against splitting frontend/cluster
network but I can change this later on.

My focus is not on having multiple programs writing to the same file so
lazy I/O is not that interesting for me I believe. I would be happy to
later run a patched version of io500, though. I found [1] from croit to be
quite useful.

The vendor gives 3.5GB/sec sequential write and 7GB/sec sequential read per
disk (which I usually read as "it won't get faster but real-world
performance is a different pair of shoes"). So the theoretical maximum per
node would be 35GB/sec write and 70GB/sec read which I will never achieve,
but also the 2x25GbE network should not be saturated immediately either.

I've done some fio benchmarks of the NVMEs and get ~2GB/sec write
performance when run like this for NP=16 threads. I have attached a copy of
the output.

```
# fio --filename=/dev/nvme0n1 --direct=1 --fsync=1 --rw=write --bs=4k
--numjobs=$NP \
   --iodepth=1 --runtime=60 --time_based --group_reporting
--name=4k-sync-write-$NP
```

I'm running Ceph 16.2.7 deployed via cephadm. Client and server run on
Rocky Linux 8, clients currently run kernel 4.18.0-348.7.1.el8_5.x86_64
while the servers run 4.18.0-348.2.1.el8_5.x86_64.

I changed the following configuration from their defaults:

osd_memory_target           osd:16187498659
mds_cache_memory_limit      global:12884901888

I created three pools: 3rep 2rep and ec(6,2) and ran the fio benchmark with
the rbd backend. Performance degraded with multiple threads, the results
for one thread are attached. Shortly.

3-rep
   WRITE: bw=11.0MiB/s (12.5MB/s), 11.0MiB/s-11.0MiB/s (12.5MB/s-12.5MB/s),
io=718MiB (753MB), run=60001-60001msec

2-rep
   WRITE: bw=11.4MiB/s (11.0MB/s), 11.4MiB/s-11.4MiB/s (11.0MB/s-11.0MB/s),
io=686MiB (719MB), run=60001-60001msec

ec(6,2)
   WRITE: bw=5825KiB/s (5965kB/s), 5825KiB/s-5825KiB/s (5965kB/s-5965kB/s),
io=341MiB (358MB), run=60001-60001msec

I then attempted to run the IO500 benchmark with the easy tests only for
now and using stonewalltime=30. I'm currently running 14 MDS with 2 standby
and one OSD per NVME.

1 thread on one client node
[RESULT]       ior-easy-write        1.970705 GiB/s : time 36.122 seconds
[INVALID]
[RESULT]        ior-easy-read        1.604711 GiB/s : time 44.315 seconds

2 threads on 1 client node
[RESULT]       ior-easy-write        2.381101 GiB/s : time 57.877 seconds
[INVALID]
[RESULT]        ior-easy-read        2.863261 GiB/s : time 48.132 seconds

4 threads on 1 client node
[RESULT]       ior-easy-write        2.565519 GiB/s : time 74.096 seconds
[INVALID]
[RESULT]        ior-easy-read        4.076592 GiB/s : time 46.630 seconds

10 threads on 1 client node
[RESULT]       ior-easy-write        2.627818 GiB/s : time 75.594 seconds
[INVALID]
[RESULT]        ior-easy-read        4.628728 GiB/s : time 42.913 seconds

20 threads on 2 client nodes (10 threads each)
[RESULT]       ior-easy-write        5.075989 GiB/s : time 80.186 seconds
[INVALID]
[RESULT]        ior-easy-read        8.605426 GiB/s : time 47.299 seconds

40 threads on 4 client nodes (10 threads each)
[RESULT]       ior-easy-write       10.108431 GiB/s : time 81.134 seconds
[INVALID]
[RESULT]        ior-easy-read       10.845139 GiB/s : time 75.460 seconds

80 threads on 8 client nodes (10 threads each)
[RESULT]       ior-easy-write       10.378105 GiB/s : time 128.358 seconds
[INVALID]
[RESULT]        ior-easy-read       10.898354 GiB/s : time 122.212 seconds

So it looks like we can saturate a single client node for writing with ~4
threads already right now.

What could good next steps be? I'd try the following in the given order.
But maybe I'm missing something basic in terms of settings (sysctl settings
are also attached).

- I could try to increase the number of OSDs per physical disk and try out
2, 4, 8, or even 16 OSDs per NVME as my fio benchmarks indicate that I even
get speedups when going from 8 to 16 threads in fio.
- I could switch to using the 2x100GbE only but 10GB/sec does not feel like
the network is the bottleneck.
- I could try to increase the MDS count but that should only affect the
meta data operations at this point, I believe.
- Probably it is not worth looking into lazy I/O for the ior-easy-*
benchmarks.

I could try to tinker around with the BIOS settings some more, in
particular in the area of the AMD NUMA domains... but my gut feeling tells
me that this influence should be <<10% and I'd like to see a much better
improvement. With 80 NVMes disks, I should be limited to 160GB/sec from the
disks. The 3-rep pool will need to write everyhing three times so the limit
is at most ~53GB/sec (which is above the theoretical bandwidth of the
network path).

Does anyone else have a similar build for CephFS and can tell me what I can
hope for? Naively, I would expect my disks to saturate the network. Also,
considering the IO500 entry for Ceph from 2020 [2], I would expect better
performance. They got a write bandwidth of  36GB/sec and read bandwidth of
75GB/sec with 10 server nodes and 8 disks each, if I understand correctly,
I assume they used 2-rep, though. They used 10 client nodes with 32 threads
each...

Best wishes,
Manuel

[1] https://croit.io/blog/ceph-performance-test-and-optimization
[2] https://io500.org/submissions/view/82

On Tue, Jul 27, 2021 at 7:21 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

Hi Manuel, I will reply inline below.

On 7/27/21 11:10 AM, Manuel Holtgrewe wrote:
Hi,

thank you very much for the detailed response. We are in the area of
life sciences so our work loads are a mixed bag of high-throughput
computing. I'm seeing the following in our current storage system and
have some questions connected to this:

1. All users like to install software with conda, which means I have
hundreds/thousands of Linux File Hierarchy Structure directories with
small files but there are not too many files here and these directory
trees are created once and then only read before they are deleted.

So this sounds like a pretty good candidate for ephemeral pinning.  The
idea is that when a new directory is created it is pseudorandomly
assigned to an MDS.  This works well if you have lots of directories so
work can be spread across MDSes without having to do a bunch of work
figuring out how to split directories and distribute the work on the fly.

2. Users have many large files that are written sequentially and users
then perform something like an 80/20% mix of sequential/random access
reads on them.

This generally should be pretty good as long as it's not something like
small unaligned sequential sync writes or something.  We were seeing
that kind of pattern with etcd's journal and it we weren't nearly as
fast as something like a local NVMe drive (but I don't many any other
distributed file systems are likely to do that well either).

3. Other users like to have directory trees with many small files (the
human genome has about 20k genes and they like to create one file per
gene). I can probably teach them to create sub directories with the
first two characters of each file to cut down on these huge directories.

So the gist of it is that you probably aren't going to see more than
about 10-20K IOPs per MDS server.  IE if you put those 20K files in one
directory with ephemeral pinning you aren't going to get more
performance for that single dataset.  If you use dynamic subtree
partitioning you can theoretically do better because cephfs will try to
distribute the work to other MDSes, but in practice this might not work
well, especially if you have a really huge directory.

3.1 Is there any guidance on what number of entries per directories is
still OK and when it starts being problematic?

So unfortunately it's really complicated.  If you are using ephemeral
pinning I suspect it's larger, but everything in the directory will
basically always be served by a single MDS (the benefit of which is that
there's no background work going on to try to move workload around).
With dynamic subtree partitioning it seems to be a function of how many
subtrees you have, how many MDSes, how likely it is to hit lock
contention, how many files, and various other factors.

3.2 What's the problem with many writes to random files in one directory?

I think this was more about lots of writers all writing to files in the
same directory especially when using dynamic subtree partitioning and
having many subtrees causing journaling to slow down on the auth MDS.

4. Still other users either have large HDF5 files that they like to
perform random reads within them.
5. Still other users like to have many small files and randomly read
some of them.
6. Almost nobody uses MPI and almost nobody writes to the same file at
the same time.

Sounds like a pretty standard distribution of use cases.

We would not run the system in a "hyperconverged" way but rather build
a separate Ceph cluster from them.

If I understand correctly, it looks like CephFS would be a good fit
and except for problems with directories having (too) many files we
would be within the comfort zone of CephFS. If I also understand
correctly, the optimizations that people point to as
experimental/fringe such as lazy I/O will not be of much help with our
use case.

My personal take is that for standard large read/write workloads cephfs
should do pretty well.  I think I was getting something like 60-70GB/s
reads across our test performance cluster (but at least for me I was
limited to about 3GB/s per client when using kernel cephfs, 8GB/s per
client when using libcephfs directly). if you can make good use of
ephemeral pinning you can scale aggregate small random performance
alright (~10-20K IOPS per MDS, I was getting something like ~100-200K
aggregate in my test setup with ephemeral pinning).  The cases where the
entire workload is in one directory or where you have many clients all
reading/writing to a single file are where we struggle imho.  I saw much
more volatile performance in those tests.

What do you mean when you are referring to "power loss protection"? I
assume that this "only" protects against data corruption and does not
bring advantages in terms of performance. We're looking at buying
enterprise grade NVMEs in any case. I just checked and the options we
are following all have that feature.

A side benefit of many power loss protection implementations is that the
drive has enough juice to basically ignore cache flushes and just sync
writes to flash when it's optimal for the drive.  IE if you do a sync
write at the block level, the kernel will tell the drive to do something
like an ATA_CMD_FLUSH (or whatever the equiv nvme command is). With
proper PLP, it can just ignore that and keep it in cache because if the
drive loses power it can still sync it to disk after the fact.  Without
power loss protection, for the drive to be safe, it actually has to
write the data out to the flash cell.

Thank you again and best wishes,
Manuel

On Wed, Jul 21, 2021 at 6:36 PM Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> wrote:

     Hi Manuel,

     I was the one that did Red Hat's IO500 CephFS submission. Feel
     free to
     ask any questions you like.  Generally speaking I could achieve 3GB/s
     pretty easily per kernel client and up to about 8GB/s per client with
     libcephfs directly (up to the aggregate cluster limits assuming
     enough
     concurrency).  Metadata is trickier.  The fastest option is if you
     have
     files spread across directories that you can manually pin
     round-robin to
     MDSes, though you can do somewhat well with ephemeral pinning too
     as a
     more automatic option.  If you have lots of clients dealing with
     lots of
     files in a single directory, that's where you revert to dynamic
     subtree
     partitioning which tends to be quite a bit slower (though at least
     some
     of this is due to journaling overhead on the auth MDS). That's
     especially true if you have a significant number of active/active MDS
     servers (say 10-20+).  We tended to consistently do very well with
     the
     "easy" IO500 tests and struggled more with the "hard" tests.
     Otherwise
     most of the standard Ceph caveats apply.  Replication eats into write
     performance, scrub/deep scrub can impact performance, choosing the
     right
     NVMe drive with power less protection and low overhead is
     important, etc.

     Probably the most important questions you should be asking
     yourself is
     how you intend to use the storage, what do you need out of it, and
     what
     you need to do to get there.  Ceph has a lot of advantages regarding
     replication, self-healing, and consistency and it's quite fast for
     some
     workloads given those advantages. There are some workloads though
     (say
     unaligned small writes from hundreds of clients to random files in a
     single directory) that potentially could be pretty slow.

     Mark

     On 7/21/21 8:54 AM, Manuel Holtgrewe wrote:
     > Dear all,
     >
     > we are looking towards setting up an all-NVME CephFS instance in
our
     > high-performance compute system. Does anyone have any experience
     to share
     > in a HPC setup or an NVME setup mounted by dozens of nodes or more?
     >
     > I've followed the impressive work done at CERN on Youtube but
     otherwise
     > there appear to be only few places using CephFS this way. There
     are a few
     > of CephFS-as-enterprise-storage vendors that sporadically
     advertise CephFS
     > for HPC but it does not appear to be a strategic main target for
     them.
     >
     > I'd be happy to read about your experience/opinion on CephFS for
     HPC.
     >
     > Best wishes,
     > Manuel
     > _______________________________________________
     > ceph-users mailing list -- ceph-users@xxxxxxx
     <mailto:ceph-users@xxxxxxx>
     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
     <mailto:ceph-users-leave@xxxxxxx>
     >

     _______________________________________________
     ceph-users mailing list -- ceph-users@xxxxxxx
     <mailto:ceph-users@xxxxxxx>
     To unsubscribe send an email to ceph-users-leave@xxxxxxx
     <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx