Profiling/expectations of ceph reads for single-host bandwidth on fast networks?

Erik Lindahl <erik.lindahl@xxxxxxxxx> · Tue, 27 Apr 2021 12:33:20 +0200

Hi friends,

We've recently deployed a few all-flash OSD nodes to improve both bandwidth
and IOPS for active data processing in CephFS, but before taking it into
active production we've been tuning it to see how far we can get the
performance in practice - it would be interesting to hear your experience
both about the bandwidth that's realistic to expect, and any hints on
remaining profiling we could do to identify the bottlenecks. Is it possible
to have rados (or even cephfs) on a single host reach anywhere close to
line rate for 50Gb networking?

Our setup:

* We use four dedicated SSD-only OSD nodes (Dell R7515, EPYC 7302) each
with 16x  Samsung PM883 7.68TB enterprise SSDs connected to a H740P raid
controller where each disk is configured as a RAID0 drive (we have tested
HBA mode too, and as expected the battery-backed write-back caching
significantly improves latency for small writes).

* The client node is a slightly older supermicro dual Xeon E5-2620v4. Both
the OSD nodes and clients have 128GB RAM, and CPU throttling has been
disabled.

* We use Mellanox 50Gb network cards, and when using iperf2 we get very
close to line speed throughput between all servers after doing the usual
sysconf settings to increase network buffers and increasing the card ring
buggers to at least 4096. (Say ~46Gb).

* All nodes have ceph pacific (16.2.0) installed through cephadm, and Linux
kernel 5.8.0 as part of Ubuntu 20.04.2. All storage is bluestore.

To start with plain Rados benchmarking (rados bench), the write performance
for 4M blocks is quite decent with a 3-fold-replicated pool. At 16 threads
we get 2.3GB/s, when bumping it to 32 threads it increases to roughly
2.8GB/s.  The client load remains low during writing, and if we reduce the
replicated pool size to 2 instead of 3, these numbers improve to ~3.5GB/s
and ~4.2GB/s, so I assume the remaining overhead is due to latencies with
the extra copies. However, those numbers are good enough that we don't
really worry about it :-)

However.... when it comes to reading, we seem to be stuck at around 2GB/s
no matter what we try. The load on the client is also quite high, with the
"rados bench" process using ~300% CPU.

To test things, we decided to shut down one of the four OSD servers - which
hardly has any effect on writing throughput, and none whatsoever on the
read throughput. In other words, it seems the bottleneck is somewhere on
the client side?

Second, when we add CephFS, we lose quite another bit of performance. If we
copy a single large (5GB) file between cephfs and /dev/shm (dropping page
caches between trials), we see write performance of roughly 1.8GB/s, while
the read performance is just 1GB/s.

(For CephFS clients, we use the kernel client in Linux-5.8 with mount
options
noatime,nowsync,rsize=67108864,wsize=67108864,readdir_max_entries=8192,readdir_max_bytes=4194304,rasize=1073741824).

While the absolute performance is quite OK, it seems a bit sad to only
achieve ~35% of line rate for writes and as little as 10% for reads, so we
want to make sure we're not leaving anything on the table here.

Any suggestions what we could do to identify the bottlenecks would be
welcome; we'd be quite happy to invest in additional hardware if necessary,
but right now we're not quite sure what could be done to improve things :-)

All the best,

Erik

-- 
Erik Lindahl <erik.lindahl@xxxxxxxxx>
Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx