You are aware of this: https://yourcmc.ru/wiki/Ceph_performance I am having these results with ssd and 2.2GHz xeon and no cpu state/freq/cpugovernor optimalization, so your results with hdd look quite ok to me. [@c01 ~]# rados -p rbd.ssd bench 30 write Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 30 seconds or 0 objects Object prefix: benchmark_data_c01_2752661 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 162 146 583.839 584 0.0807733 0.106959 2 16 347 331 661.868 740 0.052621 0.0943461 3 16 525 509 678.552 712 0.0493101 0.0934826 4 16 676 660 659.897 604 0.107205 0.0958496 ... Total time run: 30.0622 Total writes made: 4454 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 592.638 Stddev Bandwidth: 65.0681 Max bandwidth (MB/sec): 740 Min bandwidth (MB/sec): 440 Average IOPS: 148 Stddev IOPS: 16.267 Max IOPS: 185 Min IOPS: 110 Average Latency(s): 0.107988 Stddev Latency(s): 0.0610883 Max latency(s): 0.452039 Min latency(s): 0.0209312 Cleaning up (deleting benchmark objects) Removed 4454 objects Clean up completed and total clean up time :0.732456 > Subject: CEPH 16.2.x: disappointing I/O performance > > Hi, > > I built a CEPH 16.2.x cluster with relatively fast and modern hardware, > and > its performance is kind of disappointing. I would very much appreciate > an > advice and/or pointers :-) > > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: > > 2 x Intel(R) Xeon(R) Gold 5220R CPUs > 384 GB RAM > 2 x boot drives > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches > > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, > apparmor is disabled, energy-saving features are disabled. The network > between the CEPH nodes is 40G, CEPH access network is 40G, the average > latencies are < 0.15 ms. I've personally tested the network for > throughput, > latency and loss, and can tell that it's operating as expected and > doesn't > exhibit any issues at idle or under load. > > The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2 > smaller NVME drives in each node used as DB/WAL and each HDD allocated . > ceph osd tree output: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI- > AFF > -1 288.37488 root default > -13 288.37488 datacenter ste > -14 288.37488 rack rack01 > -7 96.12495 host ceph01 > 0 hdd 9.38680 osd.0 up 1.00000 > 1.00000 > 1 hdd 9.38680 osd.1 up 1.00000 > 1.00000 > 2 hdd 9.38680 osd.2 up 1.00000 > 1.00000 > 3 hdd 9.38680 osd.3 up 1.00000 > 1.00000 > 4 hdd 9.38680 osd.4 up 1.00000 > 1.00000 > 5 hdd 9.38680 osd.5 up 1.00000 > 1.00000 > 6 hdd 9.38680 osd.6 up 1.00000 > 1.00000 > 7 hdd 9.38680 osd.7 up 1.00000 > 1.00000 > 8 hdd 9.38680 osd.8 up 1.00000 > 1.00000 > 9 nvme 5.82190 osd.9 up 1.00000 > 1.00000 > 10 nvme 5.82190 osd.10 up 1.00000 > 1.00000 > -10 96.12495 host ceph02 > 11 hdd 9.38680 osd.11 up 1.00000 > 1.00000 > 12 hdd 9.38680 osd.12 up 1.00000 > 1.00000 > 13 hdd 9.38680 osd.13 up 1.00000 > 1.00000 > 14 hdd 9.38680 osd.14 up 1.00000 > 1.00000 > 15 hdd 9.38680 osd.15 up 1.00000 > 1.00000 > 16 hdd 9.38680 osd.16 up 1.00000 > 1.00000 > 17 hdd 9.38680 osd.17 up 1.00000 > 1.00000 > 18 hdd 9.38680 osd.18 up 1.00000 > 1.00000 > 19 hdd 9.38680 osd.19 up 1.00000 > 1.00000 > 20 nvme 5.82190 osd.20 up 1.00000 > 1.00000 > 21 nvme 5.82190 osd.21 up 1.00000 > 1.00000 > -3 96.12495 host ceph03 > 22 hdd 9.38680 osd.22 up 1.00000 > 1.00000 > 23 hdd 9.38680 osd.23 up 1.00000 > 1.00000 > 24 hdd 9.38680 osd.24 up 1.00000 > 1.00000 > 25 hdd 9.38680 osd.25 up 1.00000 > 1.00000 > 26 hdd 9.38680 osd.26 up 1.00000 > 1.00000 > 27 hdd 9.38680 osd.27 up 1.00000 > 1.00000 > 28 hdd 9.38680 osd.28 up 1.00000 > 1.00000 > 29 hdd 9.38680 osd.29 up 1.00000 > 1.00000 > 30 hdd 9.38680 osd.30 up 1.00000 > 1.00000 > 31 nvme 5.82190 osd.31 up 1.00000 > 1.00000 > 32 nvme 5.82190 osd.32 up 1.00000 > 1.00000 > > ceph df: > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 253 TiB 241 TiB 13 TiB 13 TiB 5.00 > nvme 35 TiB 35 TiB 82 GiB 82 GiB 0.23 > TOTAL 288 TiB 276 TiB 13 TiB 13 TiB 4.42 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > images 12 256 24 GiB 3.15k 73 GiB 0.03 76 > TiB > volumes 13 256 839 GiB 232.16k 2.5 TiB 1.07 76 > TiB > backups 14 256 31 GiB 8.56k 94 GiB 0.04 76 > TiB > vms 15 256 752 GiB 198.80k 2.2 TiB 0.96 76 > TiB > device_health_metrics 16 32 35 MiB 39 106 MiB 0 76 > TiB > volumes-nvme 17 256 28 GiB 7.21k 81 GiB 0.24 11 > TiB > ec-volumes-meta 18 256 27 KiB 4 92 KiB 0 76 > TiB > ec-volumes-data 19 256 8 KiB 1 12 KiB 0 152 > TiB > > Please disregard the ec-pools, as they're not currently in use. All > other > pools are configured with min_size=2, size=3. All pools are bound to HDD > storage except for 'volumes-nvme', which is bound to NVME. The number of > PGs was increased recently, as with autoscaler I was getting a very > uneven > PG distribution on devices and we're expecting to add 3 more nodes of > exactly the same configuration in the coming weeks. I have to emphasize > that I tested different PG numbers and they didn't have a noticeable > impact > on the cluster performance. > > The main issue is that this beautiful cluster isn't very fast. When I > test > against the 'volumes' pool, residing on HDD storage class (HDDs with > DB/WAL > on NVME), I get unexpectedly low throughput numbers: > > > rados -p volumes bench 30 write --no-cleanup > ... > Total time run: 30.3078 > Total writes made: 3731 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 492.415 > Stddev Bandwidth: 161.777 > Max bandwidth (MB/sec): 820 > Min bandwidth (MB/sec): 204 > Average IOPS: 123 > Stddev IOPS: 40.4442 > Max IOPS: 205 > Min IOPS: 51 > Average Latency(s): 0.129115 > Stddev Latency(s): 0.143881 > Max latency(s): 1.35669 > Min latency(s): 0.0228179 > > > rados -p volumes bench 30 seq --no-cleanup > ... > Total time run: 14.7272 > Total reads made: 3731 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 1013.36 > Average IOPS: 253 > Stddev IOPS: 63.8709 > Max IOPS: 323 > Min IOPS: 91 > Average Latency(s): 0.0625202 > Max latency(s): 0.551629 > Min latency(s): 0.010683 > > On average, I get around 550 MB/s writes and 800 MB/s reads with 16 > threads > and 4MB blocks. The numbers don't look fantastic for this hardware, I > can > actually push over 8 GB/s of throughput with fio, 16 threads and 4MB > blocks > from an RBD client (KVM Linux VM) connected over a low-latency 40G > network, > probably hitting some OSD caches there: > > READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s- > 1058MB/s), > io=501GiB (538GB), run=60001-60153msec > Disk stats (read/write): > vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092, > util=99.48% > > The issue manifests when the same client does something closer to real- > life > usage, like a single-thread write or read with 4KB blocks, as if using > for > example ext4 file system: > > > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1 > --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 > ... > Run status group 0 (all jobs): > WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), > io=7694MiB (8067MB), run=64079-64079msec > Disk stats (read/write): > vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216, > util=77.31% > > > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1 > --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 > ... > Run status group 0 (all jobs): > READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s- > 56.7MB/s), > io=3242MiB (3399MB), run=60001-60001msec > Disk stats (read/write): > vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, > util=99.13% > > And this is a total disaster: the IOPS look decent, but the bandwidth is > unexpectedly very very low. I just don't understand why a single RBD > client > writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad > joke ¯\_(ツ)_/¯ > > When I run these benchmarks, nothing seems to be overloaded, things like > CPU and network are barely utilized, OSD latencies don't show anything > unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs > with DB/WAL on NVME drives should produce better I/O bandwidth, both for > writes and reads. I mean, I can easily get much better performance from > a > single HDD shared over network via NFS or iSCSI. > > I am open to suggestions and would very much appreciate comments and/or > an > advice on how to improve the cluster performance. > > Best regards, > Zakhar > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx