Hi, Thanks for this! As I mentioned in my original message, the latency is rather low, under 0.15 ms RTT. HDD write caches are disabled (I disabled them when setting the cluster up and verified just now with sdparm). /Z On Wed, Oct 6, 2021 at 9:18 AM Christian Wuerdig < christian.wuerdig@xxxxxxxxx> wrote: > Hm, generally ceph is mostly latency sensitive which would more translate > into IOPs limits rather than bandwidth. In a single threaded write scenario > your throughput is limited by the latency of the write path which is > generally network + OSD write path + disk. People have managed to get write > latencies under 1ms on all-flash setups but around 0.8ms seems the best you > can achieve which generally puts an upper limit of ~1200 IOPS on a single > threaded client if you do direct synchronized IO. But there shouldn't > really be much in the path that artificially limits bandwidth. > > Bluestore does deferred writes only for small writes - which are the > writes that will hit the WAL, writes larger than that will hit the backing > store (i.e HDD) directly. I think the default is 32KB but I could be wrong. > Obviously even for small writes the WAL will eventually have to be flushed > so your longer term performance is still impacted by your HDD speed. > So that might be why for larger block sizes the throughput suffers since > they will hit the drives directly > > It's been pointed out in the past that disabling the HDD write cache can > actually improve latency quite substantially (e.g. > https://ceph-users.ceph.narkive.com/UU9QMu9W/disabling-write-cache-on-sata-hdds-reduces-write-latency-7-times) > - might be worth a try > > > On Wed, 6 Oct 2021 at 10:07, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote: > >> I'm not sure, fio might be showing some bogus values in the summary, I'll >> check the readings again tomorrow. >> >> Another thing I noticed is that writes seem bandwidth-limited and don't >> scale well with block size and/or number of threads. I.e. one clients >> writes at about the same speed regardless of the benchmark settings. A >> person on reddit, where I asked this question as well, suggested that in a >> replicated pool writes and reads are handled by the primary PG, which would >> explain this write bandwidth limit. >> >> /Z >> >> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <christian.wuerdig@xxxxxxxxx> >> wrote: >> >>> Maybe some info is missing but 7k write IOPs at 4k block size seem >>> fairly decent (as you also state) - the bandwidth automatically follows >>> from that so not sure what you're expecting? >>> I am a bit puzzled though - by my math 7k IOPS at 4k should only be >>> 27MiB/sec - not sure how the 120MiB/sec was achieved >>> The read benchmark seems in line with 13k IOPS at 4k making around >>> 52MiB/sec bandwidth which again is expected. >>> >>> >>> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko <zakhar@xxxxxxxxx> >>> wrote: >>> >>>> Hi, >>>> >>>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware, >>>> and >>>> its performance is kind of disappointing. I would very much appreciate >>>> an >>>> advice and/or pointers :-) >>>> >>>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: >>>> >>>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs >>>> 384 GB RAM >>>> 2 x boot drives >>>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) >>>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) >>>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) >>>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches >>>> >>>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, >>>> apparmor is disabled, energy-saving features are disabled. The network >>>> between the CEPH nodes is 40G, CEPH access network is 40G, the average >>>> latencies are < 0.15 ms. I've personally tested the network for >>>> throughput, >>>> latency and loss, and can tell that it's operating as expected and >>>> doesn't >>>> exhibit any issues at idle or under load. >>>> >>>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2 >>>> smaller NVME drives in each node used as DB/WAL and each HDD allocated . >>>> ceph osd tree output: >>>> >>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT >>>> PRI-AFF >>>> -1 288.37488 root default >>>> -13 288.37488 datacenter ste >>>> -14 288.37488 rack rack01 >>>> -7 96.12495 host ceph01 >>>> 0 hdd 9.38680 osd.0 up 1.00000 >>>> 1.00000 >>>> 1 hdd 9.38680 osd.1 up 1.00000 >>>> 1.00000 >>>> 2 hdd 9.38680 osd.2 up 1.00000 >>>> 1.00000 >>>> 3 hdd 9.38680 osd.3 up 1.00000 >>>> 1.00000 >>>> 4 hdd 9.38680 osd.4 up 1.00000 >>>> 1.00000 >>>> 5 hdd 9.38680 osd.5 up 1.00000 >>>> 1.00000 >>>> 6 hdd 9.38680 osd.6 up 1.00000 >>>> 1.00000 >>>> 7 hdd 9.38680 osd.7 up 1.00000 >>>> 1.00000 >>>> 8 hdd 9.38680 osd.8 up 1.00000 >>>> 1.00000 >>>> 9 nvme 5.82190 osd.9 up 1.00000 >>>> 1.00000 >>>> 10 nvme 5.82190 osd.10 up 1.00000 >>>> 1.00000 >>>> -10 96.12495 host ceph02 >>>> 11 hdd 9.38680 osd.11 up 1.00000 >>>> 1.00000 >>>> 12 hdd 9.38680 osd.12 up 1.00000 >>>> 1.00000 >>>> 13 hdd 9.38680 osd.13 up 1.00000 >>>> 1.00000 >>>> 14 hdd 9.38680 osd.14 up 1.00000 >>>> 1.00000 >>>> 15 hdd 9.38680 osd.15 up 1.00000 >>>> 1.00000 >>>> 16 hdd 9.38680 osd.16 up 1.00000 >>>> 1.00000 >>>> 17 hdd 9.38680 osd.17 up 1.00000 >>>> 1.00000 >>>> 18 hdd 9.38680 osd.18 up 1.00000 >>>> 1.00000 >>>> 19 hdd 9.38680 osd.19 up 1.00000 >>>> 1.00000 >>>> 20 nvme 5.82190 osd.20 up 1.00000 >>>> 1.00000 >>>> 21 nvme 5.82190 osd.21 up 1.00000 >>>> 1.00000 >>>> -3 96.12495 host ceph03 >>>> 22 hdd 9.38680 osd.22 up 1.00000 >>>> 1.00000 >>>> 23 hdd 9.38680 osd.23 up 1.00000 >>>> 1.00000 >>>> 24 hdd 9.38680 osd.24 up 1.00000 >>>> 1.00000 >>>> 25 hdd 9.38680 osd.25 up 1.00000 >>>> 1.00000 >>>> 26 hdd 9.38680 osd.26 up 1.00000 >>>> 1.00000 >>>> 27 hdd 9.38680 osd.27 up 1.00000 >>>> 1.00000 >>>> 28 hdd 9.38680 osd.28 up 1.00000 >>>> 1.00000 >>>> 29 hdd 9.38680 osd.29 up 1.00000 >>>> 1.00000 >>>> 30 hdd 9.38680 osd.30 up 1.00000 >>>> 1.00000 >>>> 31 nvme 5.82190 osd.31 up 1.00000 >>>> 1.00000 >>>> 32 nvme 5.82190 osd.32 up 1.00000 >>>> 1.00000 >>>> >>>> ceph df: >>>> >>>> --- RAW STORAGE --- >>>> CLASS SIZE AVAIL USED RAW USED %RAW USED >>>> hdd 253 TiB 241 TiB 13 TiB 13 TiB 5.00 >>>> nvme 35 TiB 35 TiB 82 GiB 82 GiB 0.23 >>>> TOTAL 288 TiB 276 TiB 13 TiB 13 TiB 4.42 >>>> >>>> --- POOLS --- >>>> POOL ID PGS STORED OBJECTS USED %USED MAX >>>> AVAIL >>>> images 12 256 24 GiB 3.15k 73 GiB 0.03 76 >>>> TiB >>>> volumes 13 256 839 GiB 232.16k 2.5 TiB 1.07 76 >>>> TiB >>>> backups 14 256 31 GiB 8.56k 94 GiB 0.04 76 >>>> TiB >>>> vms 15 256 752 GiB 198.80k 2.2 TiB 0.96 76 >>>> TiB >>>> device_health_metrics 16 32 35 MiB 39 106 MiB 0 76 >>>> TiB >>>> volumes-nvme 17 256 28 GiB 7.21k 81 GiB 0.24 11 >>>> TiB >>>> ec-volumes-meta 18 256 27 KiB 4 92 KiB 0 76 >>>> TiB >>>> ec-volumes-data 19 256 8 KiB 1 12 KiB 0 152 >>>> TiB >>>> >>>> Please disregard the ec-pools, as they're not currently in use. All >>>> other >>>> pools are configured with min_size=2, size=3. All pools are bound to HDD >>>> storage except for 'volumes-nvme', which is bound to NVME. The number of >>>> PGs was increased recently, as with autoscaler I was getting a very >>>> uneven >>>> PG distribution on devices and we're expecting to add 3 more nodes of >>>> exactly the same configuration in the coming weeks. I have to emphasize >>>> that I tested different PG numbers and they didn't have a noticeable >>>> impact >>>> on the cluster performance. >>>> >>>> The main issue is that this beautiful cluster isn't very fast. When I >>>> test >>>> against the 'volumes' pool, residing on HDD storage class (HDDs with >>>> DB/WAL >>>> on NVME), I get unexpectedly low throughput numbers: >>>> >>>> > rados -p volumes bench 30 write --no-cleanup >>>> ... >>>> Total time run: 30.3078 >>>> Total writes made: 3731 >>>> Write size: 4194304 >>>> Object size: 4194304 >>>> Bandwidth (MB/sec): 492.415 >>>> Stddev Bandwidth: 161.777 >>>> Max bandwidth (MB/sec): 820 >>>> Min bandwidth (MB/sec): 204 >>>> Average IOPS: 123 >>>> Stddev IOPS: 40.4442 >>>> Max IOPS: 205 >>>> Min IOPS: 51 >>>> Average Latency(s): 0.129115 >>>> Stddev Latency(s): 0.143881 >>>> Max latency(s): 1.35669 >>>> Min latency(s): 0.0228179 >>>> >>>> > rados -p volumes bench 30 seq --no-cleanup >>>> ... >>>> Total time run: 14.7272 >>>> Total reads made: 3731 >>>> Read size: 4194304 >>>> Object size: 4194304 >>>> Bandwidth (MB/sec): 1013.36 >>>> Average IOPS: 253 >>>> Stddev IOPS: 63.8709 >>>> Max IOPS: 323 >>>> Min IOPS: 91 >>>> Average Latency(s): 0.0625202 >>>> Max latency(s): 0.551629 >>>> Min latency(s): 0.010683 >>>> >>>> On average, I get around 550 MB/s writes and 800 MB/s reads with 16 >>>> threads >>>> and 4MB blocks. The numbers don't look fantastic for this hardware, I >>>> can >>>> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB >>>> blocks >>>> from an RBD client (KVM Linux VM) connected over a low-latency 40G >>>> network, >>>> probably hitting some OSD caches there: >>>> >>>> READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s >>>> (61.7MB/s-1058MB/s), >>>> io=501GiB (538GB), run=60001-60153msec >>>> Disk stats (read/write): >>>> vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092, >>>> util=99.48% >>>> >>>> The issue manifests when the same client does something closer to >>>> real-life >>>> usage, like a single-thread write or read with 4KB blocks, as if using >>>> for >>>> example ext4 file system: >>>> >>>> > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1 >>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 >>>> ... >>>> Run status group 0 (all jobs): >>>> WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), >>>> io=7694MiB (8067MB), run=64079-64079msec >>>> Disk stats (read/write): >>>> vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216, >>>> util=77.31% >>>> >>>> > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1 >>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 >>>> ... >>>> Run status group 0 (all jobs): >>>> READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s >>>> (56.7MB/s-56.7MB/s), >>>> io=3242MiB (3399MB), run=60001-60001msec >>>> Disk stats (read/write): >>>> vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, >>>> util=99.13% >>>> >>>> And this is a total disaster: the IOPS look decent, but the bandwidth is >>>> unexpectedly very very low. I just don't understand why a single RBD >>>> client >>>> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad >>>> joke ¯\_(ツ)_/¯ >>>> >>>> When I run these benchmarks, nothing seems to be overloaded, things like >>>> CPU and network are barely utilized, OSD latencies don't show anything >>>> unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs >>>> with DB/WAL on NVME drives should produce better I/O bandwidth, both for >>>> writes and reads. I mean, I can easily get much better performance from >>>> a >>>> single HDD shared over network via NFS or iSCSI. >>>> >>>> I am open to suggestions and would very much appreciate comments and/or >>>> an >>>> advice on how to improve the cluster performance. >>>> >>>> Best regards, >>>> Zakhar >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx