Re: Ceph performance problems

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Wed, 22 Mar 2023 22:54:15 -0400

Hi Dominik,

RADOS bench will perform parallel IOs, which stresses the internal
configuration, but it would not be the speed of an individual client.  Ceph
is inherently designed for fairness, due to the pseudo random distribution
of data, and the sharded storage design.   Kernel mounts are going to be
fastest, and you can play with caching parameters, and things like
readahead.

When reading, you are reading from one OSD at a time, when writing you are
writing to all redundant OSDs (three if you use 3x replication, or however
many with an erasure coded setup).  Writing is acknowledged only when all
participating OSDs have completed.hardened their writes, so you will have
network latency + OS overhead + ceph overhead + drive latency for each
write.

We have used tricks, such as parallelizing IO across multiple RBD images,
and increasing queue depth, but that's not cephfs, rather ZFS or XFS on top
of RBD.  With proper block alignment, we have seen reasonable performance
from such setups.

For your network links, are you using 802.3ad aggregation and have your MTU
correctly set across the board - client, OSD nods, MONs?  You will want the
MTU to be ideally the same (1500, 9216, 9000 etc.) across the entire
cluster.  Check your hashing algorithm (we use layer 2+3 for most setups).

I would also focus on benchmarking what you would use this cluster for, if
there's one thing I learned from the storage industry is that there are
lies, damn lies, and benchmarks.  If your workload goes to 5000 IOPS max,
you do not need a million IOPS.  If you need good latency response, buy the
best NVMe drives possible for your use case, because latency will always go
all the way to the drive itself.

Hope this helps, and others can likely address cephfs aspects for you.
--
Alex Gorbachev
https://alextelescope.blogspot.com

On Wed, Mar 22, 2023 at 10:16 AM Dominik Baack <
dominik.baack@xxxxxxxxxxxxxxxxxx> wrote:

> Hi,
>
> we are currently testing out ways to increase Ceph performance because
> what we experience so far is very close to unusable.
>
> For the test cluster we utilizing 4 nodes with the following hardware data:
>
> Dual 200GBe Mellanox Ethernet
> 2x EPYC Rome 7302
> 16x 32GB 3200MHz ECC
> 9x 15.36TB Micron 9300 Pro
>
> For production this will be extended to all 8 nodes if it shows
> promising results.
>
> - Ceph was installed with cephadm.
> - MDS and ODS are located on the same nodes.
> - Mostly using stock config
>
> - Network performance tested with iperf3 seems fine, 26Gbits/s with -P4
> on single port (details below).
>     Close to 200Gbits with 10 parallel instances and servers.
>
> When testing a mounted CephFS on the working nodes in various
> configurations I only got <50MB/s for fuse mount and <270MB/s for kernel
> mounts. (dd command and output attached below)
> In addition ceph dashboard and our graphana monitoring reports packet
> loss on all relevant interfaces during load. Which does not occur during
> the normal iperf load tests or rsync/scp file transfer.
>
> Rados Bench shows performance around 2000MB/s which is not max
> performance of the SSDs but fine for us (details below).
>
>
> Why is the filesystem so slow compared to the individual components?
>
> Cheers
> Dominik
>
>
>
>
> Test details:
>
>
> ------------------------------------------------------------------------------------------------------
>
> Some tests done on working nodes:
>
> Ceph mounted with ceph-fuse
>
> root@ml2ran10:/mnt/cephfs/backup# dd if=/dev/zero of=testfile bs=1M
> count=4096 oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4,3 GB, 4,0 GiB) copied, 88,2933 s, 48,6 MB/s
>
>
> Ceph mounted with kernel driver:
>
> root@ml2ran06:/mnt/cephfs/backup# dd if=/dev/zero of=testfile bs=1M
> count=4096 oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 16.0989 s, 267 MB/s
>
>
> Storage Node
>
> With fuse
>
> root@ml2rsn05:/mnt/ml2r_storage/backup# dd if=/dev/zero of=testfile
> bs=1M count=4096 oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.9977 s, 79.5 MB/s
>
> Kernel mount:
>
> dd if=/dev/zero of=testfile bs=1M count=4096 oflag=direct
> 4096+0 records in
> 4096+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.6726 s, 243 MB/s
>
> _______________________________________________________
>
> Iperf3
>
> iperf3 --zerocopy  -n 10240M -P4 -c ml2ran08s0 -p 4701 -i 15 -b
> 200000000000
> Connecting to host ml2ran08s0, port 4701
> [  5] local 129.217.31.180 port 43958 connected to 129.217.31.218 port 4701
> [  7] local 129.217.31.180 port 43960 connected to 129.217.31.218 port 4701
> [  9] local 129.217.31.180 port 43962 connected to 129.217.31.218 port 4701
> [ 11] local 129.217.31.180 port 43964 connected to 129.217.31.218 port 4701
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-3.21   sec  2.50 GBytes  6.69 Gbits/sec    0    632 KBytes
> [  7]   0.00-3.21   sec  2.50 GBytes  6.70 Gbits/sec    0    522 KBytes
> [  9]   0.00-3.21   sec  2.50 GBytes  6.69 Gbits/sec    0    612 KBytes
> [ 11]   0.00-3.21   sec  2.50 GBytes  6.69 Gbits/sec    0    430 KBytes
> [SUM]   0.00-3.21   sec  10.0 GBytes  26.8 Gbits/sec    0
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-3.21   sec  2.50 GBytes  6.69 Gbits/sec 0             sender
> [  5]   0.00-3.21   sec  2.50 GBytes  6.67 Gbits/sec
> receiver
> [  7]   0.00-3.21   sec  2.50 GBytes  6.70 Gbits/sec 0             sender
> [  7]   0.00-3.21   sec  2.50 GBytes  6.67 Gbits/sec
> receiver
> [  9]   0.00-3.21   sec  2.50 GBytes  6.69 Gbits/sec 0             sender
> [  9]   0.00-3.21   sec  2.49 GBytes  6.67 Gbits/sec
> receiver
> [ 11]   0.00-3.21   sec  2.50 GBytes  6.69 Gbits/sec 0             sender
> [ 11]   0.00-3.21   sec  2.50 GBytes  6.67 Gbits/sec
> receiver
> [SUM]   0.00-3.21   sec  10.0 GBytes  26.8 Gbits/sec 0             sender
> [SUM]   0.00-3.21   sec  9.98 GBytes  26.7 Gbits/sec
> receiver
>
>
>
> _________________________________________________________________
>
> Rados Bench on storage node:
>
> # rados bench -p testbench 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_ml2rsn05_2829244
>    sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg
> lat(s)
>      0       0         0         0         0         0 -           0
>      1      16       747       731   2923.84      2924 0.0178757
> 0.0216262
>      2      16      1506      1490   2979.71      3036 0.0308664
> 0.0213685
>      3      16      2267      2251   3000.99      3044 0.0259053
> 0.0212556
>      4      16      3058      3042   3041.62      3164 0.0227621
> 0.0209792
>      5      16      3850      3834    3066.8      3168 0.0130519
> 0.0208148
>      6      16      4625      4609   3072.26      3100 0.151371   0.0207904
>      7      16      5381      5365   3065.28      3024 0.0300368
> 0.0208345
>      8      16      6172      6156   3077.57      3164 0.0197728
> 0.0207714
>      9      16      6971      6955   3090.67      3196 0.0142751
> 0.0206786
>     10      14      7772      7758   3102.76      3212 0.0181034
> 0.020605
> Total time run:         10.0179
> Total writes made:      7772
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     3103.23
> Stddev Bandwidth:       93.3676
> Max bandwidth (MB/sec): 3212
> Min bandwidth (MB/sec): 2924
> Average IOPS:           775
> Stddev IOPS:            23.3419
> Max IOPS:               803
> Min IOPS:               731
> Average Latency(s):     0.020598
> Stddev Latency(s):      0.00731743
> Max latency(s):         0.151371
> Min latency(s):         0.00966991
>
>
>
> # rados bench -p testbench 10 seq
> hints = 1
>    sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg
> lat(s)
>      0       0         0         0         0         0 -           0
>      1      15       657       642   2567.32      2568 0.011104    0.022631
>      2      15      1244      1229    2456.9      2348 0.0115248
> 0.019485
>      3      16      1499      1483   1976.64      1016 0.00722983
> 0.0177887
>      4      15      1922      1907   1906.16      1696 0.0142242
> 0.0327382
>      5      15      2593      2578    2061.6      2684 0.011758   0.0301774
>      6      16      3142      3126   2083.23      2192 0.00915926
> 0.027478
>      7      16      3276      3260   1862.23       536 0.00824714
> 0.0267449
>      8      16      3606      3590   1794.43      1320 0.0118938
> 0.0350541
>      9      16      4293      4277   1900.32      2748 0.0301886
> 0.0330604
>     10      14      5003      4989   1995.04      2848 0.0389717
> 0.0314977
> Total time run:       10.0227
> Total reads made:     5003
> Read size:            4194304
> Object size:          4194304
> Bandwidth (MB/sec):   1996.67
> Average IOPS:         499
> Stddev IOPS:          202.3
> Max IOPS:             712
> Min IOPS:             134
> Average Latency(s):   0.0314843
> Max latency(s):       3.04463
> Min latency(s):       0.00551523
>
>
> # rados bench -p testbench 10 rand
> hints = 1
>    sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg
> lat(s)
>      0      15        15         0         0         0 -           0
>      1      15       680       665   2657.61      2660 0.00919807
> 0.0224833
>      2      15      1273      1258   2514.26      2372 0.00839656
> 0.0247125
>      3      16      1863      1847    2461.4      2356 0.00994467
> 0.0236565
>      4      16      2064      2048   2047.14       804 0.00809139
> 0.0223506
>      5      16      2064      2048   1637.79         0 -   0.0223506
>      6      16      2477      2461   1640.12       826 0.0286315
> 0.0383254
>      7      16      3102      3086   1762.89      2500 0.0267464
> 0.0349189
>      8      16      3513      3497      1748      1644 0.00890952
> 0.032269
>      9      16      3617      3601      1600       416 0.00626917
> 0.0316019
>     10      15      4014      3999   1599.18      1592 0.0461076
> 0.0393606
> Total time run:       10.0481
> Total reads made:     4014
> Read size:            4194304
> Object size:          4194304
> Bandwidth (MB/sec):   1597.91
> Average IOPS:         399
> Stddev IOPS:          239.089
> Max IOPS:             665
> Min IOPS:             0
> Average Latency(s):   0.0394035
> Max latency(s):       3.00962
> Min latency(s):       0.00449537
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx