I'm not sure, fio might be showing some bogus values in the summary, I'll check the readings again tomorrow. Another thing I noticed is that writes seem bandwidth-limited and don't scale well with block size and/or number of threads. I.e. one clients writes at about the same speed regardless of the benchmark settings. A person on reddit, where I asked this question as well, suggested that in a replicated pool writes and reads are handled by the primary PG, which would explain this write bandwidth limit. /Z On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <christian.wuerdig@xxxxxxxxx> wrote: > Maybe some info is missing but 7k write IOPs at 4k block size seem fairly > decent (as you also state) - the bandwidth automatically follows from that > so not sure what you're expecting? > I am a bit puzzled though - by my math 7k IOPS at 4k should only be > 27MiB/sec - not sure how the 120MiB/sec was achieved > The read benchmark seems in line with 13k IOPS at 4k making around > 52MiB/sec bandwidth which again is expected. > > > On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote: > >> Hi, >> >> I built a CEPH 16.2.x cluster with relatively fast and modern hardware, >> and >> its performance is kind of disappointing. I would very much appreciate an >> advice and/or pointers :-) >> >> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: >> >> 2 x Intel(R) Xeon(R) Gold 5220R CPUs >> 384 GB RAM >> 2 x boot drives >> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) >> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) >> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) >> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches >> >> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, >> apparmor is disabled, energy-saving features are disabled. The network >> between the CEPH nodes is 40G, CEPH access network is 40G, the average >> latencies are < 0.15 ms. I've personally tested the network for >> throughput, >> latency and loss, and can tell that it's operating as expected and doesn't >> exhibit any issues at idle or under load. >> >> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2 >> smaller NVME drives in each node used as DB/WAL and each HDD allocated . >> ceph osd tree output: >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 288.37488 root default >> -13 288.37488 datacenter ste >> -14 288.37488 rack rack01 >> -7 96.12495 host ceph01 >> 0 hdd 9.38680 osd.0 up 1.00000 1.00000 >> 1 hdd 9.38680 osd.1 up 1.00000 1.00000 >> 2 hdd 9.38680 osd.2 up 1.00000 1.00000 >> 3 hdd 9.38680 osd.3 up 1.00000 1.00000 >> 4 hdd 9.38680 osd.4 up 1.00000 1.00000 >> 5 hdd 9.38680 osd.5 up 1.00000 1.00000 >> 6 hdd 9.38680 osd.6 up 1.00000 1.00000 >> 7 hdd 9.38680 osd.7 up 1.00000 1.00000 >> 8 hdd 9.38680 osd.8 up 1.00000 1.00000 >> 9 nvme 5.82190 osd.9 up 1.00000 1.00000 >> 10 nvme 5.82190 osd.10 up 1.00000 1.00000 >> -10 96.12495 host ceph02 >> 11 hdd 9.38680 osd.11 up 1.00000 1.00000 >> 12 hdd 9.38680 osd.12 up 1.00000 1.00000 >> 13 hdd 9.38680 osd.13 up 1.00000 1.00000 >> 14 hdd 9.38680 osd.14 up 1.00000 1.00000 >> 15 hdd 9.38680 osd.15 up 1.00000 1.00000 >> 16 hdd 9.38680 osd.16 up 1.00000 1.00000 >> 17 hdd 9.38680 osd.17 up 1.00000 1.00000 >> 18 hdd 9.38680 osd.18 up 1.00000 1.00000 >> 19 hdd 9.38680 osd.19 up 1.00000 1.00000 >> 20 nvme 5.82190 osd.20 up 1.00000 1.00000 >> 21 nvme 5.82190 osd.21 up 1.00000 1.00000 >> -3 96.12495 host ceph03 >> 22 hdd 9.38680 osd.22 up 1.00000 1.00000 >> 23 hdd 9.38680 osd.23 up 1.00000 1.00000 >> 24 hdd 9.38680 osd.24 up 1.00000 1.00000 >> 25 hdd 9.38680 osd.25 up 1.00000 1.00000 >> 26 hdd 9.38680 osd.26 up 1.00000 1.00000 >> 27 hdd 9.38680 osd.27 up 1.00000 1.00000 >> 28 hdd 9.38680 osd.28 up 1.00000 1.00000 >> 29 hdd 9.38680 osd.29 up 1.00000 1.00000 >> 30 hdd 9.38680 osd.30 up 1.00000 1.00000 >> 31 nvme 5.82190 osd.31 up 1.00000 1.00000 >> 32 nvme 5.82190 osd.32 up 1.00000 1.00000 >> >> ceph df: >> >> --- RAW STORAGE --- >> CLASS SIZE AVAIL USED RAW USED %RAW USED >> hdd 253 TiB 241 TiB 13 TiB 13 TiB 5.00 >> nvme 35 TiB 35 TiB 82 GiB 82 GiB 0.23 >> TOTAL 288 TiB 276 TiB 13 TiB 13 TiB 4.42 >> >> --- POOLS --- >> POOL ID PGS STORED OBJECTS USED %USED MAX >> AVAIL >> images 12 256 24 GiB 3.15k 73 GiB 0.03 76 >> TiB >> volumes 13 256 839 GiB 232.16k 2.5 TiB 1.07 76 >> TiB >> backups 14 256 31 GiB 8.56k 94 GiB 0.04 76 >> TiB >> vms 15 256 752 GiB 198.80k 2.2 TiB 0.96 76 >> TiB >> device_health_metrics 16 32 35 MiB 39 106 MiB 0 76 >> TiB >> volumes-nvme 17 256 28 GiB 7.21k 81 GiB 0.24 11 >> TiB >> ec-volumes-meta 18 256 27 KiB 4 92 KiB 0 76 >> TiB >> ec-volumes-data 19 256 8 KiB 1 12 KiB 0 152 >> TiB >> >> Please disregard the ec-pools, as they're not currently in use. All other >> pools are configured with min_size=2, size=3. All pools are bound to HDD >> storage except for 'volumes-nvme', which is bound to NVME. The number of >> PGs was increased recently, as with autoscaler I was getting a very uneven >> PG distribution on devices and we're expecting to add 3 more nodes of >> exactly the same configuration in the coming weeks. I have to emphasize >> that I tested different PG numbers and they didn't have a noticeable >> impact >> on the cluster performance. >> >> The main issue is that this beautiful cluster isn't very fast. When I test >> against the 'volumes' pool, residing on HDD storage class (HDDs with >> DB/WAL >> on NVME), I get unexpectedly low throughput numbers: >> >> > rados -p volumes bench 30 write --no-cleanup >> ... >> Total time run: 30.3078 >> Total writes made: 3731 >> Write size: 4194304 >> Object size: 4194304 >> Bandwidth (MB/sec): 492.415 >> Stddev Bandwidth: 161.777 >> Max bandwidth (MB/sec): 820 >> Min bandwidth (MB/sec): 204 >> Average IOPS: 123 >> Stddev IOPS: 40.4442 >> Max IOPS: 205 >> Min IOPS: 51 >> Average Latency(s): 0.129115 >> Stddev Latency(s): 0.143881 >> Max latency(s): 1.35669 >> Min latency(s): 0.0228179 >> >> > rados -p volumes bench 30 seq --no-cleanup >> ... >> Total time run: 14.7272 >> Total reads made: 3731 >> Read size: 4194304 >> Object size: 4194304 >> Bandwidth (MB/sec): 1013.36 >> Average IOPS: 253 >> Stddev IOPS: 63.8709 >> Max IOPS: 323 >> Min IOPS: 91 >> Average Latency(s): 0.0625202 >> Max latency(s): 0.551629 >> Min latency(s): 0.010683 >> >> On average, I get around 550 MB/s writes and 800 MB/s reads with 16 >> threads >> and 4MB blocks. The numbers don't look fantastic for this hardware, I can >> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB >> blocks >> from an RBD client (KVM Linux VM) connected over a low-latency 40G >> network, >> probably hitting some OSD caches there: >> >> READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s-1058MB/s), >> io=501GiB (538GB), run=60001-60153msec >> Disk stats (read/write): >> vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092, >> util=99.48% >> >> The issue manifests when the same client does something closer to >> real-life >> usage, like a single-thread write or read with 4KB blocks, as if using for >> example ext4 file system: >> >> > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1 >> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 >> ... >> Run status group 0 (all jobs): >> WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), >> io=7694MiB (8067MB), run=64079-64079msec >> Disk stats (read/write): >> vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216, >> util=77.31% >> >> > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1 >> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 >> ... >> Run status group 0 (all jobs): >> READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s-56.7MB/s), >> io=3242MiB (3399MB), run=60001-60001msec >> Disk stats (read/write): >> vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, util=99.13% >> >> And this is a total disaster: the IOPS look decent, but the bandwidth is >> unexpectedly very very low. I just don't understand why a single RBD >> client >> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad >> joke ¯\_(ツ)_/¯ >> >> When I run these benchmarks, nothing seems to be overloaded, things like >> CPU and network are barely utilized, OSD latencies don't show anything >> unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs >> with DB/WAL on NVME drives should produce better I/O bandwidth, both for >> writes and reads. I mean, I can easily get much better performance from a >> single HDD shared over network via NFS or iSCSI. >> >> I am open to suggestions and would very much appreciate comments and/or an >> advice on how to improve the cluster performance. >> >> Best regards, >> Zakhar >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx