Got it. I don't have any specific throttling set up for RBD-backed storage. I also previously tested several different backends and found that virtio consistently produced better performance than virtio-scsi in different scenarios, thus my VMs run virtio. /Z On Wed, Oct 6, 2021 at 7:10 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: > To be clear, I’m suspecting explicit throttling as described here: > > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques > > not impact from virtualization as such, though depending on the versions > of software involved, the device emulation chosen can make a big > difference, eg. virtio-scsi vs virtio-blk vs IDE. > > If one has Prometheus / Grafana set up to track throughput and iops per > volume / attachment / VM, or enables the client-side admin socket, that > sort of throttling can be very visually apparent. > > > > On Oct 5, 2021, at 8:35 PM, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote: > > > > Hi! > > > > The clients are KVM VMs, there's QEMU/libvirt impact for sure. I will > test > > with a baremetal client and see whether it performs much better. > > > > /Z > > > > > > On Wed, 6 Oct 2021, 01:29 Anthony D'Atri, <anthony.datri@xxxxxxxxx> > wrote: > > > >> The lead PG handling ops isn’t a factor, with RBD your volumes touch > >> dozens / hundreds of PGs. But QD=1 and small block sizes are going to > >> limit your throughput. > >> > >> What are your clients? Are they bare metal? Are they VMs? If they’re > >> VMs, do you have QEMU/libvirt throttling in play? I see that a lot. > >> > >>> On Oct 5, 2021, at 2:06 PM, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > >>> > >>> I'm not sure, fio might be showing some bogus values in the summary, > I'll > >>> check the readings again tomorrow. > >>> > >>> Another thing I noticed is that writes seem bandwidth-limited and don't > >>> scale well with block size and/or number of threads. I.e. one clients > >>> writes at about the same speed regardless of the benchmark settings. A > >>> person on reddit, where I asked this question as well, suggested that > in > >> a > >>> replicated pool writes and reads are handled by the primary PG, which > >> would > >>> explain this write bandwidth limit. > >>> > >>> /Z > >>> > >>> On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, < > >> christian.wuerdig@xxxxxxxxx> > >>> wrote: > >>> > >>>> Maybe some info is missing but 7k write IOPs at 4k block size seem > >> fairly > >>>> decent (as you also state) - the bandwidth automatically follows from > >> that > >>>> so not sure what you're expecting? > >>>> I am a bit puzzled though - by my math 7k IOPS at 4k should only be > >>>> 27MiB/sec - not sure how the 120MiB/sec was achieved > >>>> The read benchmark seems in line with 13k IOPS at 4k making around > >>>> 52MiB/sec bandwidth which again is expected. > >>>> > >>>> > >>>> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > >> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I built a CEPH 16.2.x cluster with relatively fast and modern > hardware, > >>>>> and > >>>>> its performance is kind of disappointing. I would very much > appreciate > >> an > >>>>> advice and/or pointers :-) > >>>>> > >>>>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: > >>>>> > >>>>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs > >>>>> 384 GB RAM > >>>>> 2 x boot drives > >>>>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) > >>>>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) > >>>>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) > >>>>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches > >>>>> > >>>>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, > >>>>> apparmor is disabled, energy-saving features are disabled. The > network > >>>>> between the CEPH nodes is 40G, CEPH access network is 40G, the > average > >>>>> latencies are < 0.15 ms. I've personally tested the network for > >>>>> throughput, > >>>>> latency and loss, and can tell that it's operating as expected and > >> doesn't > >>>>> exhibit any issues at idle or under load. > >>>>> > >>>>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, > with 2 > >>>>> smaller NVME drives in each node used as DB/WAL and each HDD > allocated > >> . > >>>>> ceph osd tree output: > >>>>> > >>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > >> PRI-AFF > >>>>> -1 288.37488 root default > >>>>> -13 288.37488 datacenter ste > >>>>> -14 288.37488 rack rack01 > >>>>> -7 96.12495 host ceph01 > >>>>> 0 hdd 9.38680 osd.0 up 1.00000 > >> 1.00000 > >>>>> 1 hdd 9.38680 osd.1 up 1.00000 > >> 1.00000 > >>>>> 2 hdd 9.38680 osd.2 up 1.00000 > >> 1.00000 > >>>>> 3 hdd 9.38680 osd.3 up 1.00000 > >> 1.00000 > >>>>> 4 hdd 9.38680 osd.4 up 1.00000 > >> 1.00000 > >>>>> 5 hdd 9.38680 osd.5 up 1.00000 > >> 1.00000 > >>>>> 6 hdd 9.38680 osd.6 up 1.00000 > >> 1.00000 > >>>>> 7 hdd 9.38680 osd.7 up 1.00000 > >> 1.00000 > >>>>> 8 hdd 9.38680 osd.8 up 1.00000 > >> 1.00000 > >>>>> 9 nvme 5.82190 osd.9 up 1.00000 > >> 1.00000 > >>>>> 10 nvme 5.82190 osd.10 up 1.00000 > >> 1.00000 > >>>>> -10 96.12495 host ceph02 > >>>>> 11 hdd 9.38680 osd.11 up 1.00000 > >> 1.00000 > >>>>> 12 hdd 9.38680 osd.12 up 1.00000 > >> 1.00000 > >>>>> 13 hdd 9.38680 osd.13 up 1.00000 > >> 1.00000 > >>>>> 14 hdd 9.38680 osd.14 up 1.00000 > >> 1.00000 > >>>>> 15 hdd 9.38680 osd.15 up 1.00000 > >> 1.00000 > >>>>> 16 hdd 9.38680 osd.16 up 1.00000 > >> 1.00000 > >>>>> 17 hdd 9.38680 osd.17 up 1.00000 > >> 1.00000 > >>>>> 18 hdd 9.38680 osd.18 up 1.00000 > >> 1.00000 > >>>>> 19 hdd 9.38680 osd.19 up 1.00000 > >> 1.00000 > >>>>> 20 nvme 5.82190 osd.20 up 1.00000 > >> 1.00000 > >>>>> 21 nvme 5.82190 osd.21 up 1.00000 > >> 1.00000 > >>>>> -3 96.12495 host ceph03 > >>>>> 22 hdd 9.38680 osd.22 up 1.00000 > >> 1.00000 > >>>>> 23 hdd 9.38680 osd.23 up 1.00000 > >> 1.00000 > >>>>> 24 hdd 9.38680 osd.24 up 1.00000 > >> 1.00000 > >>>>> 25 hdd 9.38680 osd.25 up 1.00000 > >> 1.00000 > >>>>> 26 hdd 9.38680 osd.26 up 1.00000 > >> 1.00000 > >>>>> 27 hdd 9.38680 osd.27 up 1.00000 > >> 1.00000 > >>>>> 28 hdd 9.38680 osd.28 up 1.00000 > >> 1.00000 > >>>>> 29 hdd 9.38680 osd.29 up 1.00000 > >> 1.00000 > >>>>> 30 hdd 9.38680 osd.30 up 1.00000 > >> 1.00000 > >>>>> 31 nvme 5.82190 osd.31 up 1.00000 > >> 1.00000 > >>>>> 32 nvme 5.82190 osd.32 up 1.00000 > >> 1.00000 > >>>>> > >>>>> ceph df: > >>>>> > >>>>> --- RAW STORAGE --- > >>>>> CLASS SIZE AVAIL USED RAW USED %RAW USED > >>>>> hdd 253 TiB 241 TiB 13 TiB 13 TiB 5.00 > >>>>> nvme 35 TiB 35 TiB 82 GiB 82 GiB 0.23 > >>>>> TOTAL 288 TiB 276 TiB 13 TiB 13 TiB 4.42 > >>>>> > >>>>> --- POOLS --- > >>>>> POOL ID PGS STORED OBJECTS USED %USED MAX > >>>>> AVAIL > >>>>> images 12 256 24 GiB 3.15k 73 GiB 0.03 > 76 > >>>>> TiB > >>>>> volumes 13 256 839 GiB 232.16k 2.5 TiB 1.07 > 76 > >>>>> TiB > >>>>> backups 14 256 31 GiB 8.56k 94 GiB 0.04 > 76 > >>>>> TiB > >>>>> vms 15 256 752 GiB 198.80k 2.2 TiB 0.96 > 76 > >>>>> TiB > >>>>> device_health_metrics 16 32 35 MiB 39 106 MiB 0 > 76 > >>>>> TiB > >>>>> volumes-nvme 17 256 28 GiB 7.21k 81 GiB 0.24 > 11 > >>>>> TiB > >>>>> ec-volumes-meta 18 256 27 KiB 4 92 KiB 0 > 76 > >>>>> TiB > >>>>> ec-volumes-data 19 256 8 KiB 1 12 KiB 0 > 152 > >>>>> TiB > >>>>> > >>>>> Please disregard the ec-pools, as they're not currently in use. All > >> other > >>>>> pools are configured with min_size=2, size=3. All pools are bound to > >> HDD > >>>>> storage except for 'volumes-nvme', which is bound to NVME. The number > >> of > >>>>> PGs was increased recently, as with autoscaler I was getting a very > >> uneven > >>>>> PG distribution on devices and we're expecting to add 3 more nodes of > >>>>> exactly the same configuration in the coming weeks. I have to > emphasize > >>>>> that I tested different PG numbers and they didn't have a noticeable > >>>>> impact > >>>>> on the cluster performance. > >>>>> > >>>>> The main issue is that this beautiful cluster isn't very fast. When I > >> test > >>>>> against the 'volumes' pool, residing on HDD storage class (HDDs with > >>>>> DB/WAL > >>>>> on NVME), I get unexpectedly low throughput numbers: > >>>>> > >>>>>> rados -p volumes bench 30 write --no-cleanup > >>>>> ... > >>>>> Total time run: 30.3078 > >>>>> Total writes made: 3731 > >>>>> Write size: 4194304 > >>>>> Object size: 4194304 > >>>>> Bandwidth (MB/sec): 492.415 > >>>>> Stddev Bandwidth: 161.777 > >>>>> Max bandwidth (MB/sec): 820 > >>>>> Min bandwidth (MB/sec): 204 > >>>>> Average IOPS: 123 > >>>>> Stddev IOPS: 40.4442 > >>>>> Max IOPS: 205 > >>>>> Min IOPS: 51 > >>>>> Average Latency(s): 0.129115 > >>>>> Stddev Latency(s): 0.143881 > >>>>> Max latency(s): 1.35669 > >>>>> Min latency(s): 0.0228179 > >>>>> > >>>>>> rados -p volumes bench 30 seq --no-cleanup > >>>>> ... > >>>>> Total time run: 14.7272 > >>>>> Total reads made: 3731 > >>>>> Read size: 4194304 > >>>>> Object size: 4194304 > >>>>> Bandwidth (MB/sec): 1013.36 > >>>>> Average IOPS: 253 > >>>>> Stddev IOPS: 63.8709 > >>>>> Max IOPS: 323 > >>>>> Min IOPS: 91 > >>>>> Average Latency(s): 0.0625202 > >>>>> Max latency(s): 0.551629 > >>>>> Min latency(s): 0.010683 > >>>>> > >>>>> On average, I get around 550 MB/s writes and 800 MB/s reads with 16 > >>>>> threads > >>>>> and 4MB blocks. The numbers don't look fantastic for this hardware, I > >> can > >>>>> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB > >>>>> blocks > >>>>> from an RBD client (KVM Linux VM) connected over a low-latency 40G > >>>>> network, > >>>>> probably hitting some OSD caches there: > >>>>> > >>>>> READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s > >> (61.7MB/s-1058MB/s), > >>>>> io=501GiB (538GB), run=60001-60153msec > >>>>> Disk stats (read/write): > >>>>> vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092, > >>>>> util=99.48% > >>>>> > >>>>> The issue manifests when the same client does something closer to > >>>>> real-life > >>>>> usage, like a single-thread write or read with 4KB blocks, as if > using > >> for > >>>>> example ext4 file system: > >>>>> > >>>>>> fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1 > >>>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 > >>>>> ... > >>>>> Run status group 0 (all jobs): > >>>>> WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), > >>>>> io=7694MiB (8067MB), run=64079-64079msec > >>>>> Disk stats (read/write): > >>>>> vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216, > >>>>> util=77.31% > >>>>> > >>>>>> fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1 > >>>>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 > >>>>> ... > >>>>> Run status group 0 (all jobs): > >>>>> READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s > >> (56.7MB/s-56.7MB/s), > >>>>> io=3242MiB (3399MB), run=60001-60001msec > >>>>> Disk stats (read/write): > >>>>> vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, > >> util=99.13% > >>>>> > >>>>> And this is a total disaster: the IOPS look decent, but the bandwidth > >> is > >>>>> unexpectedly very very low. I just don't understand why a single RBD > >>>>> client > >>>>> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a > >> bad > >>>>> joke ¯\_(ツ)_/¯ > >>>>> > >>>>> When I run these benchmarks, nothing seems to be overloaded, things > >> like > >>>>> CPU and network are barely utilized, OSD latencies don't show > anything > >>>>> unusual. Thus I am puzzled with these results, as in my opinion SAS > >> HDDs > >>>>> with DB/WAL on NVME drives should produce better I/O bandwidth, both > >> for > >>>>> writes and reads. I mean, I can easily get much better performance > >> from a > >>>>> single HDD shared over network via NFS or iSCSI. > >>>>> > >>>>> I am open to suggestions and would very much appreciate comments > >> and/or an > >>>>> advice on how to improve the cluster performance. > >>>>> > >>>>> Best regards, > >>>>> Zakhar > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> > >>>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx