Re: ceph and openstack throttling experience

David Caro <dcaro@xxxxxxxxxxxxx> · Thu, 10 Jun 2021 11:49:56 +0200

We have a similar setup, way smaller though (~120 osds right now) :)

We have different capped VMs, but most have 500 write, 1000 read iops cap, you can see it in effect here:
https://cloud-ceph-performance-tests.toolforge.org/

We are currently running Octopus v15.2.11.

It's a very 'bare' ui (under construction), but check the 'after_ceph_upgrade_v2' for example, the 'vm_disk' suite, the
'RunConfig(rw=randread, bs=4096, ioengine=libaio, iodepth=1)' or
'RunConfig(rw=randwrite, bs=4096, ioengine=libaio, iodepth=1)' tests that hit the cap.

From there you can also see the numbers of the tests running uncapped (in the 'rbd_from_hypervisor' or 'rbd_from_osd'
suites).

You can see the current iops of our ceph cluster here:
https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1

Of our openstack setup:
https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m

And some details on the traffic openstck puts on each ceph osd host here:
https://grafana.wikimedia.org/d/wsoKtElZk/wmcs-ceph-eqiad-network-utilization?orgId=1&refresh=5m

We are working on revamping those graphs right now, so it might become easier to see numbers in a few weeks.

We don't usually see slow ops with the current load, though we recommend not using ceph for very latency sensitive VMs
(like etcd), as on the network layer there's some hardware limits we can't remove right now.

Hope that helps.

On 06/10 10:54, Marcel Kuiper wrote:
> Hi
> 
> We're running ceph nautilus 14.2.21 (going to octopus latest in a few weeks)
> as volume and instance backend for our openstack vm's. Our clusters run
> somewhere between 500 - 1000 OSDs on SAS HDDs with NVMe's as journal and db
> device
> 
> Currently we do not have our vm's capped on iops and throughput. We
> regularly get slowops warnings (once or twice per day) and wonder whether
> there are more users with sort of the same setup that do throttle their
> openstack vm's.
> 
> - What kind of numbers are used in the field for IOPS and throughput
> limiting?
> 
> - As a side question, is there an easy way to get rid of the slowops warning
> besides restarting the involved osd. Otherwise the warning seems to stay
> forever
> 
> Regards
> 
> Marcel
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx