Re: ceph and openstack throttling experience

Marcel Kuiper <ceph@xxxxxxxx> · Thu, 10 Jun 2021 14:05:15 +0200

Hi David,

That is very helpful thank you. When looking at the graphs I notice that 
the bandwidth used looks as if this is very low. Or am I misinterpreting 
the bandwidth graphs?

Regards

Marcel

David Caro schreef op 2021-06-10 11:49:
We have a similar setup, way smaller though (~120 osds right now) :)

We have different capped VMs, but most have 500 write, 1000 read iops
cap, you can see it in effect here:
https://cloud-ceph-performance-tests.toolforge.org/

We are currently running Octopus v15.2.11.

It's a very 'bare' ui (under construction), but check the
'after_ceph_upgrade_v2' for example, the 'vm_disk' suite, the
'RunConfig(rw=randread, bs=4096, ioengine=libaio, iodepth=1)' or
'RunConfig(rw=randwrite, bs=4096, ioengine=libaio, iodepth=1)' tests
that hit the cap.

From there you can also see the numbers of the tests running uncapped
(in the 'rbd_from_hypervisor' or 'rbd_from_osd'
suites).

You can see the current iops of our ceph cluster here:
https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1

Of our openstack setup:
https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?orgId=1&refresh=15m

And some details on the traffic openstck puts on each ceph osd host 
here:
https://grafana.wikimedia.org/d/wsoKtElZk/wmcs-ceph-eqiad-network-utilization?orgId=1&refresh=5m

We are working on revamping those graphs right now, so it might become
easier to see numbers in a few weeks.

We don't usually see slow ops with the current load, though we
recommend not using ceph for very latency sensitive VMs
(like etcd), as on the network layer there's some hardware limits we
can't remove right now.

Hope that helps.

On 06/10 10:54, Marcel Kuiper wrote:
Hi

We're running ceph nautilus 14.2.21 (going to octopus latest in a few 
weeks)
as volume and instance backend for our openstack vm's. Our clusters 
run
somewhere between 500 - 1000 OSDs on SAS HDDs with NVMe's as journal 
and db
device

Currently we do not have our vm's capped on iops and throughput. We
regularly get slowops warnings (once or twice per day) and wonder 
whether
there are more users with sort of the same setup that do throttle 
their
openstack vm's.

- What kind of numbers are used in the field for IOPS and throughput
limiting?

- As a side question, is there an easy way to get rid of the slowops 
warning
besides restarting the involved osd. Otherwise the warning seems to 
stay
forever

Regards

Marcel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx