Can you collect the output of this command on all 4 servers while your test is running: iostat -mtxy 1 This should show how busy the CPUs are as well as how busy each drive is. On Thu, Apr 29, 2021 at 7:52 AM Schmid, Michael <m.schmid@xxxxxxxxxxxxxxxxxxx> wrote: > > Hello folks, > > I am new to ceph and at the moment I am doing some performance tests with a 4 node ceph-cluster (pacific, 16.2.1). > > Node hardware (4 identical nodes): > > * DELL 3620 workstation > * Intel Quad-Core i7-6700@3.4 GHz > * 8 GB RAM > * Debian Buster (base system, installed a dedicated on Patriot Burst 120 GB SATA-SSD) > * HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s from node to node) > * 1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss protection !) > * 3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB) > > After bootstrapping a containerized (docker) ceph-cluster, I did some performance tests on the NVMe storage by creating a storage pool called „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A first write-performance test yields > > ============= > root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects > Object prefix: benchmark_data_ceph1_78 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 16 30 14 55.997 56 0.0209977 0.493427 > 2 16 53 37 73.9903 92 0.0264305 0.692179 > 3 16 76 60 79.9871 92 0.559505 0.664204 > 4 16 99 83 82.9879 92 0.609332 0.721016 > 5 16 116 100 79.9889 68 0.686093 0.698084 > 6 16 132 116 77.3224 64 1.19715 0.731808 > 7 16 153 137 78.2741 84 0.622646 0.755812 > 8 16 171 155 77.486 72 0.25409 0.764022 > 9 16 192 176 78.2076 84 0.968321 0.775292 > 10 16 214 198 79.1856 88 0.401339 0.766764 > 11 1 214 213 77.4408 60 0.969693 0.784002 > Total time run: 11.0698 > Total writes made: 214 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 77.3272 > Stddev Bandwidth: 13.7722 > Max bandwidth (MB/sec): 92 > Min bandwidth (MB/sec): 56 > Average IOPS: 19 > Stddev IOPS: 3.44304 > Max IOPS: 23 > Min IOPS: 14 > Average Latency(s): 0.785372 > Stddev Latency(s): 0.49011 > Max latency(s): 2.16532 > Min latency(s): 0.0144995 > ============= > > ... and I think that 80 MB/s throughput is a very poor result in conjunction with NVMe devices and 10 GBit nics. > > A bare write-test (with fsync=0 option) of the NVMe drives yields a write throughput of round about 800 MB/s per device ... the second test (with fsync=1) drops performance to 200 MB/s. > > ============= > root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --bs=1024k --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --group_reporting --runtime=30 --time_based --fsync=0 > IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32... > fio-3.12 > Starting 4 processes > Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s] > IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 2021 > write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone resets > slat (usec): min=16, max=810, avg=106.48, stdev=30.48 > clat (msec): min=7, max=1110, avg=172.09, stdev=120.18 > lat (msec): min=7, max=1110, avg=172.19, stdev=120.18 > clat percentiles (msec): > | 1.00th=[ 32], 5.00th=[ 48], 10.00th=[ 53], 20.00th=[ 63], > | 30.00th=[ 115], 40.00th=[ 161], 50.00th=[ 169], 60.00th=[ 178], > | 70.00th=[ 190], 80.00th=[ 220], 90.00th=[ 264], 95.00th=[ 368], > | 99.00th=[ 667], 99.50th=[ 751], 99.90th=[ 894], 99.95th=[ 986], > | 99.99th=[ 1036] > bw ( KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, stdev=113845.69, samples=240 > iops : min= 22, max= 624, avg=185.11, stdev=111.18, samples=240 > lat (msec) : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52% > lat (msec) : 500=8.21%, 750=2.85%, 1000=0.47% > cpu : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% > issued rwts: total=0,22359,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=32 > > Run status group 0 (all jobs): > WRITE: bw=740MiB/s (776MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s), io=21.8GiB (23.4GB), run=30206-30206msec > > Disk stats (read/write): > nvme0n1: ios=0/89150, merge=0/0, ticks=0/15065724, in_queue=15118720, util=99.75% > ============= > > Furthermore an IOPS-test on the NVMe device with block-size 4k shows round about 1000 IOPS with fsnyc=1 and 35000 IOPS with fsync=0. > > To my question: As CPU- and network-load seem to be low during my tests, I would like to know, which bottleneck can cause such a huge performance drop between the bare hardware-performance of the nvme-drives and the write-speeds in the rados benchmark. Could the missing power loss protection (fsync=1) be the problem, or what throughput should one expect to be normal in such a setup? > > Thanks for every advice! > > Best regards, > Michael > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx