RBD images Prometheus metrics : not all pools/images reported

Gilles Mocellin <gilles.mocellin@xxxxxxxxxxxxxx> · Tue, 16 Aug 2022 16:04:11 +0200

Hello Cephers,

I'm trying to diagnose who's doing what on our cluster, which suffer 
from SLOW_OPS, High latency periods since Pacific.

And I can't see all pool / images in RBD stats.
I had activated RBD image stats while running Octopus, now it seems we 
only need to define mgr/prometheus/rbd_stats_pools.
I have put '*' to catch all pools.

First question: even specifying explicitly an EC data pool, it doesn't 
seem to have stats.
I can understand that image stats would be collected at metadata pool.
Is it correct ?

But, second question: I can only see 3 pools in Prometheus metrics like 
ceph_rbd_read_ops (among ~20, I use OpenStack with all its pools).

So, either in the Dashboard graphs or in my Grafana, I can only see 
metrics concerning these pools.

Mmm, I'm just seeing one thing... I have no image in the other pools... 
Gnocchi does not store images, my cinder-backup pool is empty, my second 
cinder pool also,
And finally, all radosgw pools are not storing rbd images too...

So I think I have my answer to that second question.

Anyway, it's strange that I can't find the same value comparing the pool 
statistics with the sum of the RBD image in it :

sum(irate(ceph_rbd_write_bytes{cluster="mycluster",pool="myvolumepool"}[1m]))
irate(ceph_pool_wr_bytes{cluster="mycluster",pool_id="myvolumedatapoolid"}[1m])

There's more than 10 times ceph_pool_wr_bytes on the datapool than the 
sum of all ceph_rbd_write_bytes on the metadata pool.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx