Hi Dominik, RADOS bench will perform parallel IOs, which stresses the internal configuration, but it would not be the speed of an individual client. Ceph is inherently designed for fairness, due to the pseudo random distribution of data, and the sharded storage design. Kernel mounts are going to be fastest, and you can play with caching parameters, and things like readahead. When reading, you are reading from one OSD at a time, when writing you are writing to all redundant OSDs (three if you use 3x replication, or however many with an erasure coded setup). Writing is acknowledged only when all participating OSDs have completed.hardened their writes, so you will have network latency + OS overhead + ceph overhead + drive latency for each write. We have used tricks, such as parallelizing IO across multiple RBD images, and increasing queue depth, but that's not cephfs, rather ZFS or XFS on top of RBD. With proper block alignment, we have seen reasonable performance from such setups. For your network links, are you using 802.3ad aggregation and have your MTU correctly set across the board - client, OSD nods, MONs? You will want the MTU to be ideally the same (1500, 9216, 9000 etc.) across the entire cluster. Check your hashing algorithm (we use layer 2+3 for most setups). I would also focus on benchmarking what you would use this cluster for, if there's one thing I learned from the storage industry is that there are lies, damn lies, and benchmarks. If your workload goes to 5000 IOPS max, you do not need a million IOPS. If you need good latency response, buy the best NVMe drives possible for your use case, because latency will always go all the way to the drive itself. Hope this helps, and others can likely address cephfs aspects for you. -- Alex Gorbachev https://alextelescope.blogspot.com On Wed, Mar 22, 2023 at 10:16 AM Dominik Baack < dominik.baack@xxxxxxxxxxxxxxxxxx> wrote: > Hi, > > we are currently testing out ways to increase Ceph performance because > what we experience so far is very close to unusable. > > For the test cluster we utilizing 4 nodes with the following hardware data: > > Dual 200GBe Mellanox Ethernet > 2x EPYC Rome 7302 > 16x 32GB 3200MHz ECC > 9x 15.36TB Micron 9300 Pro > > For production this will be extended to all 8 nodes if it shows > promising results. > > - Ceph was installed with cephadm. > - MDS and ODS are located on the same nodes. > - Mostly using stock config > > - Network performance tested with iperf3 seems fine, 26Gbits/s with -P4 > on single port (details below). > Close to 200Gbits with 10 parallel instances and servers. > > When testing a mounted CephFS on the working nodes in various > configurations I only got <50MB/s for fuse mount and <270MB/s for kernel > mounts. (dd command and output attached below) > In addition ceph dashboard and our graphana monitoring reports packet > loss on all relevant interfaces during load. Which does not occur during > the normal iperf load tests or rsync/scp file transfer. > > Rados Bench shows performance around 2000MB/s which is not max > performance of the SSDs but fine for us (details below). > > > Why is the filesystem so slow compared to the individual components? > > Cheers > Dominik > > > > > Test details: > > > ------------------------------------------------------------------------------------------------------ > > Some tests done on working nodes: > > Ceph mounted with ceph-fuse > > root@ml2ran10:/mnt/cephfs/backup# dd if=/dev/zero of=testfile bs=1M > count=4096 oflag=direct > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4,3 GB, 4,0 GiB) copied, 88,2933 s, 48,6 MB/s > > > Ceph mounted with kernel driver: > > root@ml2ran06:/mnt/cephfs/backup# dd if=/dev/zero of=testfile bs=1M > count=4096 oflag=direct > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 16.0989 s, 267 MB/s > > > Storage Node > > With fuse > > root@ml2rsn05:/mnt/ml2r_storage/backup# dd if=/dev/zero of=testfile > bs=1M count=4096 oflag=direct > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.9977 s, 79.5 MB/s > > Kernel mount: > > dd if=/dev/zero of=testfile bs=1M count=4096 oflag=direct > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.6726 s, 243 MB/s > > _______________________________________________________ > > Iperf3 > > iperf3 --zerocopy -n 10240M -P4 -c ml2ran08s0 -p 4701 -i 15 -b > 200000000000 > Connecting to host ml2ran08s0, port 4701 > [ 5] local 129.217.31.180 port 43958 connected to 129.217.31.218 port 4701 > [ 7] local 129.217.31.180 port 43960 connected to 129.217.31.218 port 4701 > [ 9] local 129.217.31.180 port 43962 connected to 129.217.31.218 port 4701 > [ 11] local 129.217.31.180 port 43964 connected to 129.217.31.218 port 4701 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-3.21 sec 2.50 GBytes 6.69 Gbits/sec 0 632 KBytes > [ 7] 0.00-3.21 sec 2.50 GBytes 6.70 Gbits/sec 0 522 KBytes > [ 9] 0.00-3.21 sec 2.50 GBytes 6.69 Gbits/sec 0 612 KBytes > [ 11] 0.00-3.21 sec 2.50 GBytes 6.69 Gbits/sec 0 430 KBytes > [SUM] 0.00-3.21 sec 10.0 GBytes 26.8 Gbits/sec 0 > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-3.21 sec 2.50 GBytes 6.69 Gbits/sec 0 sender > [ 5] 0.00-3.21 sec 2.50 GBytes 6.67 Gbits/sec > receiver > [ 7] 0.00-3.21 sec 2.50 GBytes 6.70 Gbits/sec 0 sender > [ 7] 0.00-3.21 sec 2.50 GBytes 6.67 Gbits/sec > receiver > [ 9] 0.00-3.21 sec 2.50 GBytes 6.69 Gbits/sec 0 sender > [ 9] 0.00-3.21 sec 2.49 GBytes 6.67 Gbits/sec > receiver > [ 11] 0.00-3.21 sec 2.50 GBytes 6.69 Gbits/sec 0 sender > [ 11] 0.00-3.21 sec 2.50 GBytes 6.67 Gbits/sec > receiver > [SUM] 0.00-3.21 sec 10.0 GBytes 26.8 Gbits/sec 0 sender > [SUM] 0.00-3.21 sec 9.98 GBytes 26.7 Gbits/sec > receiver > > > > _________________________________________________________________ > > Rados Bench on storage node: > > # rados bench -p testbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size > 4194304 for up to 10 seconds or 0 objects > Object prefix: benchmark_data_ml2rsn05_2829244 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - 0 > 1 16 747 731 2923.84 2924 0.0178757 > 0.0216262 > 2 16 1506 1490 2979.71 3036 0.0308664 > 0.0213685 > 3 16 2267 2251 3000.99 3044 0.0259053 > 0.0212556 > 4 16 3058 3042 3041.62 3164 0.0227621 > 0.0209792 > 5 16 3850 3834 3066.8 3168 0.0130519 > 0.0208148 > 6 16 4625 4609 3072.26 3100 0.151371 0.0207904 > 7 16 5381 5365 3065.28 3024 0.0300368 > 0.0208345 > 8 16 6172 6156 3077.57 3164 0.0197728 > 0.0207714 > 9 16 6971 6955 3090.67 3196 0.0142751 > 0.0206786 > 10 14 7772 7758 3102.76 3212 0.0181034 > 0.020605 > Total time run: 10.0179 > Total writes made: 7772 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 3103.23 > Stddev Bandwidth: 93.3676 > Max bandwidth (MB/sec): 3212 > Min bandwidth (MB/sec): 2924 > Average IOPS: 775 > Stddev IOPS: 23.3419 > Max IOPS: 803 > Min IOPS: 731 > Average Latency(s): 0.020598 > Stddev Latency(s): 0.00731743 > Max latency(s): 0.151371 > Min latency(s): 0.00966991 > > > > # rados bench -p testbench 10 seq > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - 0 > 1 15 657 642 2567.32 2568 0.011104 0.022631 > 2 15 1244 1229 2456.9 2348 0.0115248 > 0.019485 > 3 16 1499 1483 1976.64 1016 0.00722983 > 0.0177887 > 4 15 1922 1907 1906.16 1696 0.0142242 > 0.0327382 > 5 15 2593 2578 2061.6 2684 0.011758 0.0301774 > 6 16 3142 3126 2083.23 2192 0.00915926 > 0.027478 > 7 16 3276 3260 1862.23 536 0.00824714 > 0.0267449 > 8 16 3606 3590 1794.43 1320 0.0118938 > 0.0350541 > 9 16 4293 4277 1900.32 2748 0.0301886 > 0.0330604 > 10 14 5003 4989 1995.04 2848 0.0389717 > 0.0314977 > Total time run: 10.0227 > Total reads made: 5003 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 1996.67 > Average IOPS: 499 > Stddev IOPS: 202.3 > Max IOPS: 712 > Min IOPS: 134 > Average Latency(s): 0.0314843 > Max latency(s): 3.04463 > Min latency(s): 0.00551523 > > > # rados bench -p testbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 15 15 0 0 0 - 0 > 1 15 680 665 2657.61 2660 0.00919807 > 0.0224833 > 2 15 1273 1258 2514.26 2372 0.00839656 > 0.0247125 > 3 16 1863 1847 2461.4 2356 0.00994467 > 0.0236565 > 4 16 2064 2048 2047.14 804 0.00809139 > 0.0223506 > 5 16 2064 2048 1637.79 0 - 0.0223506 > 6 16 2477 2461 1640.12 826 0.0286315 > 0.0383254 > 7 16 3102 3086 1762.89 2500 0.0267464 > 0.0349189 > 8 16 3513 3497 1748 1644 0.00890952 > 0.032269 > 9 16 3617 3601 1600 416 0.00626917 > 0.0316019 > 10 15 4014 3999 1599.18 1592 0.0461076 > 0.0393606 > Total time run: 10.0481 > Total reads made: 4014 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 1597.91 > Average IOPS: 399 > Stddev IOPS: 239.089 > Max IOPS: 665 > Min IOPS: 0 > Average Latency(s): 0.0394035 > Max latency(s): 3.00962 > Min latency(s): 0.00449537 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx