Hey folks,
I have a ceph cluster supporting about 500 VMs using RBD. I am seeing around 10-12k IOPS cluster-wide and IO wait time creeping up within the VMs.
My suspicion is that I am pushing my ceph cluster to its limit in terms of overall throughput. I am curious if there are metrics that can be passively collected either in VMs or on ceph nodes to reveal the cluster is at its peak. IO wait time inside of VMs might be a good one, but I am interested in monitoring the ceph nodes directly as well. Ideally I want to track those metrics, perform some trending analysis, and provision capacity (not space, but throughput) before VM performance is impacted.
Any thoughts or experience on this matter?
Thanks.
-Simon
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com