Are you sure your ssd pool is only having ssd's and not maybe some hdd's? In past versions of ceph you had to modify crush rules to separate ssd and hdd classes. Could be this is not necessary any more in pacific. > -----Original Message----- > From: Schmid, Michael <m.schmid@xxxxxxxxxxxxxxxxxxx> > Sent: 29 April 2021 15:52 > To: ceph-users@xxxxxxx > Subject: Performance questions - 4 node (commodity) cluster > - what to expect (and what not ;-) > > Hello folks, > > I am new to ceph and at the moment I am doing some performance tests > with a 4 node ceph-cluster (pacific, 16.2.1). > > Node hardware (4 identical nodes): > > * DELL 3620 workstation > * Intel Quad-Core i7-6700@3.4 GHz > * 8 GB RAM > * Debian Buster (base system, installed a dedicated on Patriot Burst > 120 GB SATA-SSD) > * HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s > from node to node) > * 1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss > protection !) > * 3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB) > > After bootstrapping a containerized (docker) ceph-cluster, I did some > performance tests on the NVMe storage by creating a storage pool called > „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A > first write-performance test yields > > ============= > root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size > 4194304 for up to 10 seconds or 0 objects > Object prefix: benchmark_data_ceph1_78 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > lat(s) > 0 0 0 0 0 0 - > 0 > 1 16 30 14 55.997 56 0.0209977 > 0.493427 > 2 16 53 37 73.9903 92 0.0264305 > 0.692179 > 3 16 76 60 79.9871 92 0.559505 > 0.664204 > 4 16 99 83 82.9879 92 0.609332 > 0.721016 > 5 16 116 100 79.9889 68 0.686093 > 0.698084 > 6 16 132 116 77.3224 64 1.19715 > 0.731808 > 7 16 153 137 78.2741 84 0.622646 > 0.755812 > 8 16 171 155 77.486 72 0.25409 > 0.764022 > 9 16 192 176 78.2076 84 0.968321 > 0.775292 > 10 16 214 198 79.1856 88 0.401339 > 0.766764 > 11 1 214 213 77.4408 60 0.969693 > 0.784002 > Total time run: 11.0698 > Total writes made: 214 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 77.3272 > Stddev Bandwidth: 13.7722 > Max bandwidth (MB/sec): 92 > Min bandwidth (MB/sec): 56 > Average IOPS: 19 > Stddev IOPS: 3.44304 > Max IOPS: 23 > Min IOPS: 14 > Average Latency(s): 0.785372 > Stddev Latency(s): 0.49011 > Max latency(s): 2.16532 > Min latency(s): 0.0144995 > ============= > > ... and I think that 80 MB/s throughput is a very poor result in > conjunction with NVMe devices and 10 GBit nics. > > A bare write-test (with fsync=0 option) of the NVMe drives yields a > write throughput of round about 800 MB/s per device ... the second test > (with fsync=1) drops performance to 200 MB/s. > > ============= > root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write -- > bs=1024k --direct=1 --filename=/dev/nvme0n1 --numjobs=4 -- > ioengine=libaio --iodepth=32 --refill_buffers --group_reporting -- > runtime=30 --time_based --fsync=0 > IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB- > 1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32... > fio-3.12 > Starting 4 processes > Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s] > IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 > 2021 > write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone > resets > slat (usec): min=16, max=810, avg=106.48, stdev=30.48 > clat (msec): min=7, max=1110, avg=172.09, stdev=120.18 > lat (msec): min=7, max=1110, avg=172.19, stdev=120.18 > clat percentiles (msec): > | 1.00th=[ 32], 5.00th=[ 48], 10.00th=[ 53], 20.00th=[ > 63], > | 30.00th=[ 115], 40.00th=[ 161], 50.00th=[ 169], 60.00th=[ > 178], > | 70.00th=[ 190], 80.00th=[ 220], 90.00th=[ 264], 95.00th=[ > 368], > | 99.00th=[ 667], 99.50th=[ 751], 99.90th=[ 894], 99.95th=[ > 986], > | 99.99th=[ 1036] > bw ( KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, > stdev=113845.69, samples=240 > iops : min= 22, max= 624, avg=185.11, stdev=111.18, > samples=240 > lat (msec) : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52% > lat (msec) : 500=8.21%, 750=2.85%, 1000=0.47% > cpu : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,22359,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=32 > > Run status group 0 (all jobs): > WRITE: bw=740MiB/s (776MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s), > io=21.8GiB (23.4GB), run=30206-30206msec > > Disk stats (read/write): > nvme0n1: ios=0/89150, merge=0/0, ticks=0/15065724, in_queue=15118720, > util=99.75% > ============= > > Furthermore an IOPS-test on the NVMe device with block-size 4k shows > round about 1000 IOPS with fsnyc=1 and 35000 IOPS with fsync=0. > > To my question: As CPU- and network-load seem to be low during my tests, > I would like to know, which bottleneck can cause such a huge performance > drop between the bare hardware-performance of the nvme-drives and the > write-speeds in the rados benchmark. Could the missing power loss > protection (fsync=1) be the problem, or what throughput should one > expect to be normal in such a setup? > > Thanks for every advice! > > Best regards, > Michael > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx