Re: CEPH 16.2.x: disappointing I/O performance

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Tue, 5 Oct 2021 19:43:05 +0300

Hi Marc,

Many thanks for your comment! As I mentioned, rados bench results are more
or less acceptable and explainable. RBD clients writing at ~120 MB/st tops
(regardless of the number of threads or block size btw) and reading ~50
MB/s in a single thread (I managed to read over 500 MB/s using 16 threads)
are not. Literally every storage device in my setup can read and write at
least 200+ MB/s sequentially, so I'm trying to find an explanation for this
behavior.

Zakhar

On Tue, 5 Oct 2021, 18:44 Marc, <Marc@xxxxxxxxxxxxxxxxx> wrote:

> You are aware of this:
> https://yourcmc.ru/wiki/Ceph_performance
>
> I am having these results with ssd and 2.2GHz xeon and no cpu
> state/freq/cpugovernor optimalization, so your results with hdd look quite
> ok to me.
>
>
> [@c01 ~]# rados -p rbd.ssd bench 30 write
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 30 seconds or 0 objects
> Object prefix: benchmark_data_c01_2752661
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>     0       0         0         0         0         0           -
>  0
>     1      16       162       146   583.839       584   0.0807733
> 0.106959
>     2      16       347       331   661.868       740    0.052621
>  0.0943461
>     3      16       525       509   678.552       712   0.0493101
>  0.0934826
>     4      16       676       660   659.897       604    0.107205
>  0.0958496
> ...
>
> Total time run:         30.0622
> Total writes made:      4454
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     592.638
> Stddev Bandwidth:       65.0681
> Max bandwidth (MB/sec): 740
> Min bandwidth (MB/sec): 440
> Average IOPS:           148
> Stddev IOPS:            16.267
> Max IOPS:               185
> Min IOPS:               110
> Average Latency(s):     0.107988
> Stddev Latency(s):      0.0610883
> Max latency(s):         0.452039
> Min latency(s):         0.0209312
> Cleaning up (deleting benchmark objects)
> Removed 4454 objects
> Clean up completed and total clean up time :0.732456
>
> > Subject:  CEPH 16.2.x: disappointing I/O performance
> >
> > Hi,
> >
> > I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
> > and
> > its performance is kind of disappointing. I would very much appreciate
> > an
> > advice and/or pointers :-)
> >
> > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> >
> > 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> > 384 GB RAM
> > 2 x boot drives
> > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >
> > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> > apparmor is disabled, energy-saving features are disabled. The network
> > between the CEPH nodes is 40G, CEPH access network is 40G, the average
> > latencies are < 0.15 ms. I've personally tested the network for
> > throughput,
> > latency and loss, and can tell that it's operating as expected and
> > doesn't
> > exhibit any issues at idle or under load.
> >
> > The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
> > smaller NVME drives in each node used as DB/WAL and each HDD allocated .
> > ceph osd tree output:
> >
> > ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT  PRI-
> > AFF
> >  -1         288.37488  root default
> > -13         288.37488      datacenter ste
> > -14         288.37488          rack rack01
> >  -7          96.12495              host ceph01
> >   0    hdd    9.38680                  osd.0        up   1.00000
> > 1.00000
> >   1    hdd    9.38680                  osd.1        up   1.00000
> > 1.00000
> >   2    hdd    9.38680                  osd.2        up   1.00000
> > 1.00000
> >   3    hdd    9.38680                  osd.3        up   1.00000
> > 1.00000
> >   4    hdd    9.38680                  osd.4        up   1.00000
> > 1.00000
> >   5    hdd    9.38680                  osd.5        up   1.00000
> > 1.00000
> >   6    hdd    9.38680                  osd.6        up   1.00000
> > 1.00000
> >   7    hdd    9.38680                  osd.7        up   1.00000
> > 1.00000
> >   8    hdd    9.38680                  osd.8        up   1.00000
> > 1.00000
> >   9   nvme    5.82190                  osd.9        up   1.00000
> > 1.00000
> >  10   nvme    5.82190                  osd.10       up   1.00000
> > 1.00000
> > -10          96.12495              host ceph02
> >  11    hdd    9.38680                  osd.11       up   1.00000
> > 1.00000
> >  12    hdd    9.38680                  osd.12       up   1.00000
> > 1.00000
> >  13    hdd    9.38680                  osd.13       up   1.00000
> > 1.00000
> >  14    hdd    9.38680                  osd.14       up   1.00000
> > 1.00000
> >  15    hdd    9.38680                  osd.15       up   1.00000
> > 1.00000
> >  16    hdd    9.38680                  osd.16       up   1.00000
> > 1.00000
> >  17    hdd    9.38680                  osd.17       up   1.00000
> > 1.00000
> >  18    hdd    9.38680                  osd.18       up   1.00000
> > 1.00000
> >  19    hdd    9.38680                  osd.19       up   1.00000
> > 1.00000
> >  20   nvme    5.82190                  osd.20       up   1.00000
> > 1.00000
> >  21   nvme    5.82190                  osd.21       up   1.00000
> > 1.00000
> >  -3          96.12495              host ceph03
> >  22    hdd    9.38680                  osd.22       up   1.00000
> > 1.00000
> >  23    hdd    9.38680                  osd.23       up   1.00000
> > 1.00000
> >  24    hdd    9.38680                  osd.24       up   1.00000
> > 1.00000
> >  25    hdd    9.38680                  osd.25       up   1.00000
> > 1.00000
> >  26    hdd    9.38680                  osd.26       up   1.00000
> > 1.00000
> >  27    hdd    9.38680                  osd.27       up   1.00000
> > 1.00000
> >  28    hdd    9.38680                  osd.28       up   1.00000
> > 1.00000
> >  29    hdd    9.38680                  osd.29       up   1.00000
> > 1.00000
> >  30    hdd    9.38680                  osd.30       up   1.00000
> > 1.00000
> >  31   nvme    5.82190                  osd.31       up   1.00000
> > 1.00000
> >  32   nvme    5.82190                  osd.32       up   1.00000
> > 1.00000
> >
> > ceph df:
> >
> > --- RAW STORAGE ---
> > CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
> > hdd    253 TiB  241 TiB  13 TiB    13 TiB       5.00
> > nvme    35 TiB   35 TiB  82 GiB    82 GiB       0.23
> > TOTAL  288 TiB  276 TiB  13 TiB    13 TiB       4.42
> >
> > --- POOLS ---
> > POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX
> > AVAIL
> > images                 12  256   24 GiB    3.15k   73 GiB   0.03     76
> > TiB
> > volumes                13  256  839 GiB  232.16k  2.5 TiB   1.07     76
> > TiB
> > backups                14  256   31 GiB    8.56k   94 GiB   0.04     76
> > TiB
> > vms                    15  256  752 GiB  198.80k  2.2 TiB   0.96     76
> > TiB
> > device_health_metrics  16   32   35 MiB       39  106 MiB      0     76
> > TiB
> > volumes-nvme           17  256   28 GiB    7.21k   81 GiB   0.24     11
> > TiB
> > ec-volumes-meta        18  256   27 KiB        4   92 KiB      0     76
> > TiB
> > ec-volumes-data        19  256    8 KiB        1   12 KiB      0    152
> > TiB
> >
> > Please disregard the ec-pools, as they're not currently in use. All
> > other
> > pools are configured with min_size=2, size=3. All pools are bound to HDD
> > storage except for 'volumes-nvme', which is bound to NVME. The number of
> > PGs was increased recently, as with autoscaler I was getting a very
> > uneven
> > PG distribution on devices and we're expecting to add 3 more nodes of
> > exactly the same configuration in the coming weeks. I have to emphasize
> > that I tested different PG numbers and they didn't have a noticeable
> > impact
> > on the cluster performance.
> >
> > The main issue is that this beautiful cluster isn't very fast. When I
> > test
> > against the 'volumes' pool, residing on HDD storage class (HDDs with
> > DB/WAL
> > on NVME), I get unexpectedly low throughput numbers:
> >
> > > rados -p volumes bench 30 write --no-cleanup
> > ...
> > Total time run:         30.3078
> > Total writes made:      3731
> > Write size:             4194304
> > Object size:            4194304
> > Bandwidth (MB/sec):     492.415
> > Stddev Bandwidth:       161.777
> > Max bandwidth (MB/sec): 820
> > Min bandwidth (MB/sec): 204
> > Average IOPS:           123
> > Stddev IOPS:            40.4442
> > Max IOPS:               205
> > Min IOPS:               51
> > Average Latency(s):     0.129115
> > Stddev Latency(s):      0.143881
> > Max latency(s):         1.35669
> > Min latency(s):         0.0228179
> >
> > > rados -p volumes bench 30 seq --no-cleanup
> > ...
> > Total time run:       14.7272
> > Total reads made:     3731
> > Read size:            4194304
> > Object size:          4194304
> > Bandwidth (MB/sec):   1013.36
> > Average IOPS:         253
> > Stddev IOPS:          63.8709
> > Max IOPS:             323
> > Min IOPS:             91
> > Average Latency(s):   0.0625202
> > Max latency(s):       0.551629
> > Min latency(s):       0.010683
> >
> > On average, I get around 550 MB/s writes and 800 MB/s reads with 16
> > threads
> > and 4MB blocks. The numbers don't look fantastic for this hardware, I
> > can
> > actually push over 8 GB/s of throughput with fio, 16 threads and 4MB
> > blocks
> > from an RBD client (KVM Linux VM) connected over a low-latency 40G
> > network,
> > probably hitting some OSD caches there:
> >
> >    READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s-
> > 1058MB/s),
> > io=501GiB (538GB), run=60001-60153msec
> > Disk stats (read/write):
> >   vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092,
> > util=99.48%
> >
> > The issue manifests when the same client does something closer to real-
> > life
> > usage, like a single-thread write or read with 4KB blocks, as if using
> > for
> > example ext4 file system:
> >
> > > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1
> > --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
> > ...
> > Run status group 0 (all jobs):
> >   WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s),
> > io=7694MiB (8067MB), run=64079-64079msec
> > Disk stats (read/write):
> >   vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216,
> > util=77.31%
> >
> > > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1
> > --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
> > ...
> > Run status group 0 (all jobs):
> >    READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s-
> > 56.7MB/s),
> > io=3242MiB (3399MB), run=60001-60001msec
> > Disk stats (read/write):
> >   vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336,
> > util=99.13%
> >
> > And this is a total disaster: the IOPS look decent, but the bandwidth is
> > unexpectedly very very low. I just don't understand why a single RBD
> > client
> > writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad
> > joke ¯\_(ツ)_/¯
> >
> > When I run these benchmarks, nothing seems to be overloaded, things like
> > CPU and network are barely utilized, OSD latencies don't show anything
> > unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs
> > with DB/WAL on NVME drives should produce better I/O bandwidth, both for
> > writes and reads. I mean, I can easily get much better performance from
> > a
> > single HDD shared over network via NFS or iSCSI.
> >
> > I am open to suggestions and would very much appreciate comments and/or
> > an
> > advice on how to improve the cluster performance.
> >
> > Best regards,
> > Zakhar
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx