Hi, I built a CEPH 16.2.x cluster with relatively fast and modern hardware, and its performance is kind of disappointing. I would very much appreciate an advice and/or pointers :-) The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: 2 x Intel(R) Xeon(R) Gold 5220R CPUs 384 GB RAM 2 x boot drives 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) 2 x Intel XL710 NICs connected to a pair of 40/100GE switches All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, apparmor is disabled, energy-saving features are disabled. The network between the CEPH nodes is 40G, CEPH access network is 40G, the average latencies are < 0.15 ms. I've personally tested the network for throughput, latency and loss, and can tell that it's operating as expected and doesn't exhibit any issues at idle or under load. The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2 smaller NVME drives in each node used as DB/WAL and each HDD allocated . ceph osd tree output: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 288.37488 root default -13 288.37488 datacenter ste -14 288.37488 rack rack01 -7 96.12495 host ceph01 0 hdd 9.38680 osd.0 up 1.00000 1.00000 1 hdd 9.38680 osd.1 up 1.00000 1.00000 2 hdd 9.38680 osd.2 up 1.00000 1.00000 3 hdd 9.38680 osd.3 up 1.00000 1.00000 4 hdd 9.38680 osd.4 up 1.00000 1.00000 5 hdd 9.38680 osd.5 up 1.00000 1.00000 6 hdd 9.38680 osd.6 up 1.00000 1.00000 7 hdd 9.38680 osd.7 up 1.00000 1.00000 8 hdd 9.38680 osd.8 up 1.00000 1.00000 9 nvme 5.82190 osd.9 up 1.00000 1.00000 10 nvme 5.82190 osd.10 up 1.00000 1.00000 -10 96.12495 host ceph02 11 hdd 9.38680 osd.11 up 1.00000 1.00000 12 hdd 9.38680 osd.12 up 1.00000 1.00000 13 hdd 9.38680 osd.13 up 1.00000 1.00000 14 hdd 9.38680 osd.14 up 1.00000 1.00000 15 hdd 9.38680 osd.15 up 1.00000 1.00000 16 hdd 9.38680 osd.16 up 1.00000 1.00000 17 hdd 9.38680 osd.17 up 1.00000 1.00000 18 hdd 9.38680 osd.18 up 1.00000 1.00000 19 hdd 9.38680 osd.19 up 1.00000 1.00000 20 nvme 5.82190 osd.20 up 1.00000 1.00000 21 nvme 5.82190 osd.21 up 1.00000 1.00000 -3 96.12495 host ceph03 22 hdd 9.38680 osd.22 up 1.00000 1.00000 23 hdd 9.38680 osd.23 up 1.00000 1.00000 24 hdd 9.38680 osd.24 up 1.00000 1.00000 25 hdd 9.38680 osd.25 up 1.00000 1.00000 26 hdd 9.38680 osd.26 up 1.00000 1.00000 27 hdd 9.38680 osd.27 up 1.00000 1.00000 28 hdd 9.38680 osd.28 up 1.00000 1.00000 29 hdd 9.38680 osd.29 up 1.00000 1.00000 30 hdd 9.38680 osd.30 up 1.00000 1.00000 31 nvme 5.82190 osd.31 up 1.00000 1.00000 32 nvme 5.82190 osd.32 up 1.00000 1.00000 ceph df: --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 253 TiB 241 TiB 13 TiB 13 TiB 5.00 nvme 35 TiB 35 TiB 82 GiB 82 GiB 0.23 TOTAL 288 TiB 276 TiB 13 TiB 13 TiB 4.42 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL images 12 256 24 GiB 3.15k 73 GiB 0.03 76 TiB volumes 13 256 839 GiB 232.16k 2.5 TiB 1.07 76 TiB backups 14 256 31 GiB 8.56k 94 GiB 0.04 76 TiB vms 15 256 752 GiB 198.80k 2.2 TiB 0.96 76 TiB device_health_metrics 16 32 35 MiB 39 106 MiB 0 76 TiB volumes-nvme 17 256 28 GiB 7.21k 81 GiB 0.24 11 TiB ec-volumes-meta 18 256 27 KiB 4 92 KiB 0 76 TiB ec-volumes-data 19 256 8 KiB 1 12 KiB 0 152 TiB Please disregard the ec-pools, as they're not currently in use. All other pools are configured with min_size=2, size=3. All pools are bound to HDD storage except for 'volumes-nvme', which is bound to NVME. The number of PGs was increased recently, as with autoscaler I was getting a very uneven PG distribution on devices and we're expecting to add 3 more nodes of exactly the same configuration in the coming weeks. I have to emphasize that I tested different PG numbers and they didn't have a noticeable impact on the cluster performance. The main issue is that this beautiful cluster isn't very fast. When I test against the 'volumes' pool, residing on HDD storage class (HDDs with DB/WAL on NVME), I get unexpectedly low throughput numbers: > rados -p volumes bench 30 write --no-cleanup ... Total time run: 30.3078 Total writes made: 3731 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 492.415 Stddev Bandwidth: 161.777 Max bandwidth (MB/sec): 820 Min bandwidth (MB/sec): 204 Average IOPS: 123 Stddev IOPS: 40.4442 Max IOPS: 205 Min IOPS: 51 Average Latency(s): 0.129115 Stddev Latency(s): 0.143881 Max latency(s): 1.35669 Min latency(s): 0.0228179 > rados -p volumes bench 30 seq --no-cleanup ... Total time run: 14.7272 Total reads made: 3731 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 1013.36 Average IOPS: 253 Stddev IOPS: 63.8709 Max IOPS: 323 Min IOPS: 91 Average Latency(s): 0.0625202 Max latency(s): 0.551629 Min latency(s): 0.010683 On average, I get around 550 MB/s writes and 800 MB/s reads with 16 threads and 4MB blocks. The numbers don't look fantastic for this hardware, I can actually push over 8 GB/s of throughput with fio, 16 threads and 4MB blocks from an RBD client (KVM Linux VM) connected over a low-latency 40G network, probably hitting some OSD caches there: READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s-1058MB/s), io=501GiB (538GB), run=60001-60153msec Disk stats (read/write): vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092, util=99.48% The issue manifests when the same client does something closer to real-life usage, like a single-thread write or read with 4KB blocks, as if using for example ext4 file system: > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 ... Run status group 0 (all jobs): WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=7694MiB (8067MB), run=64079-64079msec Disk stats (read/write): vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216, util=77.31% > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 ... Run status group 0 (all jobs): READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s-56.7MB/s), io=3242MiB (3399MB), run=60001-60001msec Disk stats (read/write): vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, util=99.13% And this is a total disaster: the IOPS look decent, but the bandwidth is unexpectedly very very low. I just don't understand why a single RBD client writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad joke ¯\_(ツ)_/¯ When I run these benchmarks, nothing seems to be overloaded, things like CPU and network are barely utilized, OSD latencies don't show anything unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs with DB/WAL on NVME drives should produce better I/O bandwidth, both for writes and reads. I mean, I can easily get much better performance from a single HDD shared over network via NFS or iSCSI. I am open to suggestions and would very much appreciate comments and/or an advice on how to improve the cluster performance. Best regards, Zakhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx