Re: CEPH 16.2.x: disappointing I/O performance

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Wed, 6 Oct 2021 07:35:19 +0300

Got it. I don't have any specific throttling set up for RBD-backed storage.
I also previously tested several different backends and found that virtio
consistently produced better performance than virtio-scsi in different
scenarios, thus my VMs run virtio.

/Z

On Wed, Oct 6, 2021 at 7:10 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

> To be clear, I’m suspecting explicit throttling as described here:
>
>
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques
>
> not impact from virtualization as such, though depending on the versions
> of software involved, the device emulation chosen can make a big
> difference, eg. virtio-scsi vs virtio-blk vs IDE.
>
> If one has Prometheus / Grafana set up to track throughput and iops per
> volume / attachment / VM, or enables the client-side admin socket, that
> sort of throttling can be very visually apparent.
>
>
> > On Oct 5, 2021, at 8:35 PM, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
> >
> > Hi!
> >
> > The clients are KVM VMs, there's QEMU/libvirt impact for sure. I will
> test
> > with a baremetal client and see whether it performs much better.
> >
> > /Z
> >
> >
> > On Wed, 6 Oct 2021, 01:29 Anthony D'Atri, <anthony.datri@xxxxxxxxx>
> wrote:
> >
> >> The lead PG handling ops isn’t a factor, with RBD your volumes touch
> >> dozens / hundreds of PGs.   But QD=1 and small block sizes are going to
> >> limit your throughput.
> >>
> >> What are your clients?  Are they bare metal?  Are they VMs?  If they’re
> >> VMs, do you have QEMU/libvirt throttling in play?  I see that a lot.
> >>
> >>> On Oct 5, 2021, at 2:06 PM, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >>>
> >>> I'm not sure, fio might be showing some bogus values in the summary,
> I'll
> >>> check the readings again tomorrow.
> >>>
> >>> Another thing I noticed is that writes seem bandwidth-limited and don't
> >>> scale well with block size and/or number of threads. I.e. one clients
> >>> writes at about the same speed regardless of the benchmark settings. A
> >>> person on reddit, where I asked this question as well, suggested that
> in
> >> a
> >>> replicated pool writes and reads are handled by the primary PG, which
> >> would
> >>> explain this write bandwidth limit.
> >>>
> >>> /Z
> >>>
> >>> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <
> >> christian.wuerdig@xxxxxxxxx>
> >>> wrote:
> >>>
> >>>> Maybe some info is missing but 7k write IOPs at 4k block size seem
> >> fairly
> >>>> decent (as you also state) - the bandwidth automatically follows from
> >> that
> >>>> so not sure what you're expecting?
> >>>> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
> >>>> 27MiB/sec - not sure how the 120MiB/sec was achieved
> >>>> The read benchmark seems in line with 13k IOPS at 4k making around
> >>>> 52MiB/sec bandwidth which again is expected.
> >>>>
> >>>>
> >>>> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I built a CEPH 16.2.x cluster with relatively fast and modern
> hardware,
> >>>>> and
> >>>>> its performance is kind of disappointing. I would very much
> appreciate
> >> an
> >>>>> advice and/or pointers :-)
> >>>>>
> >>>>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
> >>>>>
> >>>>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
> >>>>> 384 GB RAM
> >>>>> 2 x boot drives
> >>>>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
> >>>>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
> >>>>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
> >>>>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
> >>>>>
> >>>>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
> >>>>> apparmor is disabled, energy-saving features are disabled. The
> network
> >>>>> between the CEPH nodes is 40G, CEPH access network is 40G, the
> average
> >>>>> latencies are < 0.15 ms. I've personally tested the network for
> >>>>> throughput,
> >>>>> latency and loss, and can tell that it's operating as expected and
> >> doesn't
> >>>>> exhibit any issues at idle or under load.
> >>>>>
> >>>>> The CEPH cluster is set up with 2 storage classes, NVME and HDD,
> with 2
> >>>>> smaller NVME drives in each node used as DB/WAL and each HDD
> allocated
> >> .
> >>>>> ceph osd tree output:
> >>>>>
> >>>>> ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT
> >> PRI-AFF
> >>>>> -1         288.37488  root default
> >>>>> -13         288.37488      datacenter ste
> >>>>> -14         288.37488          rack rack01
> >>>>> -7          96.12495              host ceph01
> >>>>> 0    hdd    9.38680                  osd.0        up   1.00000
> >> 1.00000
> >>>>> 1    hdd    9.38680                  osd.1        up   1.00000
> >> 1.00000
> >>>>> 2    hdd    9.38680                  osd.2        up   1.00000
> >> 1.00000
> >>>>> 3    hdd    9.38680                  osd.3        up   1.00000
> >> 1.00000
> >>>>> 4    hdd    9.38680                  osd.4        up   1.00000
> >> 1.00000
> >>>>> 5    hdd    9.38680                  osd.5        up   1.00000
> >> 1.00000
> >>>>> 6    hdd    9.38680                  osd.6        up   1.00000
> >> 1.00000
> >>>>> 7    hdd    9.38680                  osd.7        up   1.00000
> >> 1.00000
> >>>>> 8    hdd    9.38680                  osd.8        up   1.00000
> >> 1.00000
> >>>>> 9   nvme    5.82190                  osd.9        up   1.00000
> >> 1.00000
> >>>>> 10   nvme    5.82190                  osd.10       up   1.00000
> >> 1.00000
> >>>>> -10          96.12495              host ceph02
> >>>>> 11    hdd    9.38680                  osd.11       up   1.00000
> >> 1.00000
> >>>>> 12    hdd    9.38680                  osd.12       up   1.00000
> >> 1.00000
> >>>>> 13    hdd    9.38680                  osd.13       up   1.00000
> >> 1.00000
> >>>>> 14    hdd    9.38680                  osd.14       up   1.00000
> >> 1.00000
> >>>>> 15    hdd    9.38680                  osd.15       up   1.00000
> >> 1.00000
> >>>>> 16    hdd    9.38680                  osd.16       up   1.00000
> >> 1.00000
> >>>>> 17    hdd    9.38680                  osd.17       up   1.00000
> >> 1.00000
> >>>>> 18    hdd    9.38680                  osd.18       up   1.00000
> >> 1.00000
> >>>>> 19    hdd    9.38680                  osd.19       up   1.00000
> >> 1.00000
> >>>>> 20   nvme    5.82190                  osd.20       up   1.00000
> >> 1.00000
> >>>>> 21   nvme    5.82190                  osd.21       up   1.00000
> >> 1.00000
> >>>>> -3          96.12495              host ceph03
> >>>>> 22    hdd    9.38680                  osd.22       up   1.00000
> >> 1.00000
> >>>>> 23    hdd    9.38680                  osd.23       up   1.00000
> >> 1.00000
> >>>>> 24    hdd    9.38680                  osd.24       up   1.00000
> >> 1.00000
> >>>>> 25    hdd    9.38680                  osd.25       up   1.00000
> >> 1.00000
> >>>>> 26    hdd    9.38680                  osd.26       up   1.00000
> >> 1.00000
> >>>>> 27    hdd    9.38680                  osd.27       up   1.00000
> >> 1.00000
> >>>>> 28    hdd    9.38680                  osd.28       up   1.00000
> >> 1.00000
> >>>>> 29    hdd    9.38680                  osd.29       up   1.00000
> >> 1.00000
> >>>>> 30    hdd    9.38680                  osd.30       up   1.00000
> >> 1.00000
> >>>>> 31   nvme    5.82190                  osd.31       up   1.00000
> >> 1.00000
> >>>>> 32   nvme    5.82190                  osd.32       up   1.00000
> >> 1.00000
> >>>>>
> >>>>> ceph df:
> >>>>>
> >>>>> --- RAW STORAGE ---
> >>>>> CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
> >>>>> hdd    253 TiB  241 TiB  13 TiB    13 TiB       5.00
> >>>>> nvme    35 TiB   35 TiB  82 GiB    82 GiB       0.23
> >>>>> TOTAL  288 TiB  276 TiB  13 TiB    13 TiB       4.42
> >>>>>
> >>>>> --- POOLS ---
> >>>>> POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX
> >>>>> AVAIL
> >>>>> images                 12  256   24 GiB    3.15k   73 GiB   0.03
>  76
> >>>>> TiB
> >>>>> volumes                13  256  839 GiB  232.16k  2.5 TiB   1.07
>  76
> >>>>> TiB
> >>>>> backups                14  256   31 GiB    8.56k   94 GiB   0.04
>  76
> >>>>> TiB
> >>>>> vms                    15  256  752 GiB  198.80k  2.2 TiB   0.96
>  76
> >>>>> TiB
> >>>>> device_health_metrics  16   32   35 MiB       39  106 MiB      0
>  76
> >>>>> TiB
> >>>>> volumes-nvme           17  256   28 GiB    7.21k   81 GiB   0.24
>  11
> >>>>> TiB
> >>>>> ec-volumes-meta        18  256   27 KiB        4   92 KiB      0
>  76
> >>>>> TiB
> >>>>> ec-volumes-data        19  256    8 KiB        1   12 KiB      0
> 152
> >>>>> TiB
> >>>>>
> >>>>> Please disregard the ec-pools, as they're not currently in use. All
> >> other
> >>>>> pools are configured with min_size=2, size=3. All pools are bound to
> >> HDD
> >>>>> storage except for 'volumes-nvme', which is bound to NVME. The number
> >> of
> >>>>> PGs was increased recently, as with autoscaler I was getting a very
> >> uneven
> >>>>> PG distribution on devices and we're expecting to add 3 more nodes of
> >>>>> exactly the same configuration in the coming weeks. I have to
> emphasize
> >>>>> that I tested different PG numbers and they didn't have a noticeable
> >>>>> impact
> >>>>> on the cluster performance.
> >>>>>
> >>>>> The main issue is that this beautiful cluster isn't very fast. When I
> >> test
> >>>>> against the 'volumes' pool, residing on HDD storage class (HDDs with
> >>>>> DB/WAL
> >>>>> on NVME), I get unexpectedly low throughput numbers:
> >>>>>
> >>>>>> rados -p volumes bench 30 write --no-cleanup
> >>>>> ...
> >>>>> Total time run:         30.3078
> >>>>> Total writes made:      3731
> >>>>> Write size:             4194304
> >>>>> Object size:            4194304
> >>>>> Bandwidth (MB/sec):     492.415
> >>>>> Stddev Bandwidth:       161.777
> >>>>> Max bandwidth (MB/sec): 820
> >>>>> Min bandwidth (MB/sec): 204
> >>>>> Average IOPS:           123
> >>>>> Stddev IOPS:            40.4442
> >>>>> Max IOPS:               205
> >>>>> Min IOPS:               51
> >>>>> Average Latency(s):     0.129115
> >>>>> Stddev Latency(s):      0.143881
> >>>>> Max latency(s):         1.35669
> >>>>> Min latency(s):         0.0228179
> >>>>>
> >>>>>> rados -p volumes bench 30 seq --no-cleanup
> >>>>> ...
> >>>>> Total time run:       14.7272
> >>>>> Total reads made:     3731
> >>>>> Read size:            4194304
> >>>>> Object size:          4194304
> >>>>> Bandwidth (MB/sec):   1013.36
> >>>>> Average IOPS:         253
> >>>>> Stddev IOPS:          63.8709
> >>>>> Max IOPS:             323
> >>>>> Min IOPS:             91
> >>>>> Average Latency(s):   0.0625202
> >>>>> Max latency(s):       0.551629
> >>>>> Min latency(s):       0.010683
> >>>>>
> >>>>> On average, I get around 550 MB/s writes and 800 MB/s reads with 16
> >>>>> threads
> >>>>> and 4MB blocks. The numbers don't look fantastic for this hardware, I
> >> can
> >>>>> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB
> >>>>> blocks
> >>>>> from an RBD client (KVM Linux VM) connected over a low-latency 40G
> >>>>> network,
> >>>>> probably hitting some OSD caches there:
> >>>>>
> >>>>>  READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s
> >> (61.7MB/s-1058MB/s),
> >>>>> io=501GiB (538GB), run=60001-60153msec
> >>>>> Disk stats (read/write):
> >>>>> vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092,
> >>>>> util=99.48%
> >>>>>
> >>>>> The issue manifests when the same client does something closer to
> >>>>> real-life
> >>>>> usage, like a single-thread write or read with 4KB blocks, as if
> using
> >> for
> >>>>> example ext4 file system:
> >>>>>
> >>>>>> fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1
> >>>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
> >>>>> ...
> >>>>> Run status group 0 (all jobs):
> >>>>> WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s),
> >>>>> io=7694MiB (8067MB), run=64079-64079msec
> >>>>> Disk stats (read/write):
> >>>>> vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216,
> >>>>> util=77.31%
> >>>>>
> >>>>>> fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1
> >>>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
> >>>>> ...
> >>>>> Run status group 0 (all jobs):
> >>>>>  READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s
> >> (56.7MB/s-56.7MB/s),
> >>>>> io=3242MiB (3399MB), run=60001-60001msec
> >>>>> Disk stats (read/write):
> >>>>> vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336,
> >> util=99.13%
> >>>>>
> >>>>> And this is a total disaster: the IOPS look decent, but the bandwidth
> >> is
> >>>>> unexpectedly very very low. I just don't understand why a single RBD
> >>>>> client
> >>>>> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a
> >> bad
> >>>>> joke ¯\_(ツ)_/¯
> >>>>>
> >>>>> When I run these benchmarks, nothing seems to be overloaded, things
> >> like
> >>>>> CPU and network are barely utilized, OSD latencies don't show
> anything
> >>>>> unusual. Thus I am puzzled with these results, as in my opinion SAS
> >> HDDs
> >>>>> with DB/WAL on NVME drives should produce better I/O bandwidth, both
> >> for
> >>>>> writes and reads. I mean, I can easily get much better performance
> >> from a
> >>>>> single HDD shared over network via NFS or iSCSI.
> >>>>>
> >>>>> I am open to suggestions and would very much appreciate comments
> >> and/or an
> >>>>> advice on how to improve the cluster performance.
> >>>>>
> >>>>> Best regards,
> >>>>> Zakhar
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx