Re: Performance questions - 4 node (commodity) cluster - what to expect (and what not ;-)

Marc <Marc@xxxxxxxxxxxxxxxxx> · Thu, 29 Apr 2021 20:15:20 +0000

Are you sure your ssd pool is only having ssd's and not maybe some hdd's? In past versions of ceph you had to modify crush rules to separate ssd and hdd classes. Could be this is not necessary any more in pacific.

> -----Original Message-----
> From: Schmid, Michael <m.schmid@xxxxxxxxxxxxxxxxxxx>
> Sent: 29 April 2021 15:52
> To: ceph-users@xxxxxxx
> Subject:  Performance questions - 4 node (commodity) cluster
> - what to expect (and what not ;-)
> 
> Hello folks,
> 
> I am new to ceph and at the moment I am doing some performance tests
> with a 4 node ceph-cluster (pacific, 16.2.1).
> 
> Node hardware (4 identical nodes):
> 
>   *   DELL 3620 workstation
>   *   Intel Quad-Core i7-6700@3.4 GHz
>   *   8 GB RAM
>   *   Debian Buster (base system, installed a dedicated on Patriot Burst
> 120 GB SATA-SSD)
>   *   HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s
> from node to node)
>   *   1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss
> protection !)
>   *   3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB)
> 
> After bootstrapping a containerized (docker) ceph-cluster, I did some
> performance tests on the NVMe storage by creating a storage pool called
> „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A
> first write-performance test yields
> 
> =============
> root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_ceph1_78
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>     0       0         0         0         0         0           -
> 0
>     1      16        30        14    55.997        56   0.0209977
> 0.493427
>     2      16        53        37   73.9903        92   0.0264305
> 0.692179
>     3      16        76        60   79.9871        92    0.559505
> 0.664204
>     4      16        99        83   82.9879        92    0.609332
> 0.721016
>     5      16       116       100   79.9889        68    0.686093
> 0.698084
>     6      16       132       116   77.3224        64     1.19715
> 0.731808
>     7      16       153       137   78.2741        84    0.622646
> 0.755812
>     8      16       171       155    77.486        72     0.25409
> 0.764022
>     9      16       192       176   78.2076        84    0.968321
> 0.775292
>    10      16       214       198   79.1856        88    0.401339
> 0.766764
>    11       1       214       213   77.4408        60    0.969693
> 0.784002
> Total time run:         11.0698
> Total writes made:      214
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     77.3272
> Stddev Bandwidth:       13.7722
> Max bandwidth (MB/sec): 92
> Min bandwidth (MB/sec): 56
> Average IOPS:           19
> Stddev IOPS:            3.44304
> Max IOPS:               23
> Min IOPS:               14
> Average Latency(s):     0.785372
> Stddev Latency(s):      0.49011
> Max latency(s):         2.16532
> Min latency(s):         0.0144995
> =============
> 
> ... and I think that 80 MB/s throughput is a very poor result in
> conjunction with NVMe devices and 10 GBit nics.
> 
> A bare write-test (with fsync=0 option) of the NVMe drives yields a
> write throughput of round about 800 MB/s per device ... the second test
> (with fsync=1) drops performance to 200 MB/s.
> 
> =============
> root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --
> bs=1024k --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --
> ioengine=libaio --iodepth=32 --refill_buffers --group_reporting --
> runtime=30 --time_based --fsync=0
> IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-
> 1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32...
> fio-3.12
> Starting 4 processes
> Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s]
> IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03
> 2021
>   write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone
> resets
>     slat (usec): min=16, max=810, avg=106.48, stdev=30.48
>     clat (msec): min=7, max=1110, avg=172.09, stdev=120.18
>      lat (msec): min=7, max=1110, avg=172.19, stdev=120.18
>     clat percentiles (msec):
>      |  1.00th=[   32],  5.00th=[   48], 10.00th=[   53], 20.00th=[
> 63],
>      | 30.00th=[  115], 40.00th=[  161], 50.00th=[  169], 60.00th=[
> 178],
>      | 70.00th=[  190], 80.00th=[  220], 90.00th=[  264], 95.00th=[
> 368],
>      | 99.00th=[  667], 99.50th=[  751], 99.90th=[  894], 99.95th=[
> 986],
>      | 99.99th=[ 1036]
>    bw (  KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94,
> stdev=113845.69, samples=240
>    iops        : min=   22, max=  624, avg=185.11, stdev=111.18,
> samples=240
>   lat (msec)   : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52%
>   lat (msec)   : 500=8.21%, 750=2.85%, 1000=0.47%
>   cpu          : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued rwts: total=0,22359,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>   WRITE: bw=740MiB/s (776MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s),
> io=21.8GiB (23.4GB), run=30206-30206msec
> 
> Disk stats (read/write):
>   nvme0n1: ios=0/89150, merge=0/0, ticks=0/15065724, in_queue=15118720,
> util=99.75%
> =============
> 
> Furthermore an IOPS-test on the NVMe device with block-size 4k shows
> round about 1000 IOPS with fsnyc=1 and 35000 IOPS with fsync=0.
> 
> To my question: As CPU- and network-load seem to be low during my tests,
> I would like to know, which bottleneck can cause such a huge performance
> drop between the bare hardware-performance of the nvme-drives and the
> write-speeds in the rados benchmark. Could the missing power loss
> protection (fsync=1) be the problem, or what throughput should one
> expect to be normal in such a setup?
> 
> Thanks for every advice!
> 
> Best regards,
> Michael
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx