Hi friends, We've recently deployed a few all-flash OSD nodes to improve both bandwidth and IOPS for active data processing in CephFS, but before taking it into active production we've been tuning it to see how far we can get the performance in practice - it would be interesting to hear your experience both about the bandwidth that's realistic to expect, and any hints on remaining profiling we could do to identify the bottlenecks. Is it possible to have rados (or even cephfs) on a single host reach anywhere close to line rate for 50Gb networking? Our setup: * We use four dedicated SSD-only OSD nodes (Dell R7515, EPYC 7302) each with 16x Samsung PM883 7.68TB enterprise SSDs connected to a H740P raid controller where each disk is configured as a RAID0 drive (we have tested HBA mode too, and as expected the battery-backed write-back caching significantly improves latency for small writes). * The client node is a slightly older supermicro dual Xeon E5-2620v4. Both the OSD nodes and clients have 128GB RAM, and CPU throttling has been disabled. * We use Mellanox 50Gb network cards, and when using iperf2 we get very close to line speed throughput between all servers after doing the usual sysconf settings to increase network buffers and increasing the card ring buggers to at least 4096. (Say ~46Gb). * All nodes have ceph pacific (16.2.0) installed through cephadm, and Linux kernel 5.8.0 as part of Ubuntu 20.04.2. All storage is bluestore. To start with plain Rados benchmarking (rados bench), the write performance for 4M blocks is quite decent with a 3-fold-replicated pool. At 16 threads we get 2.3GB/s, when bumping it to 32 threads it increases to roughly 2.8GB/s. The client load remains low during writing, and if we reduce the replicated pool size to 2 instead of 3, these numbers improve to ~3.5GB/s and ~4.2GB/s, so I assume the remaining overhead is due to latencies with the extra copies. However, those numbers are good enough that we don't really worry about it :-) However.... when it comes to reading, we seem to be stuck at around 2GB/s no matter what we try. The load on the client is also quite high, with the "rados bench" process using ~300% CPU. To test things, we decided to shut down one of the four OSD servers - which hardly has any effect on writing throughput, and none whatsoever on the read throughput. In other words, it seems the bottleneck is somewhere on the client side? Second, when we add CephFS, we lose quite another bit of performance. If we copy a single large (5GB) file between cephfs and /dev/shm (dropping page caches between trials), we see write performance of roughly 1.8GB/s, while the read performance is just 1GB/s. (For CephFS clients, we use the kernel client in Linux-5.8 with mount options noatime,nowsync,rsize=67108864,wsize=67108864,readdir_max_entries=8192,readdir_max_bytes=4194304,rasize=1073741824). While the absolute performance is quite OK, it seems a bit sad to only achieve ~35% of line rate for writes and as little as 10% for reads, so we want to make sure we're not leaving anything on the table here. Any suggestions what we could do to identify the bottlenecks would be welcome; we'd be quite happy to invest in additional hardware if necessary, but right now we're not quite sure what could be done to improve things :-) All the best, Erik -- Erik Lindahl <erik.lindahl@xxxxxxxxx> Science for Life Laboratory, Box 1031, 17121 Solna, Sweden _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx