Hi folks, So after much poking around with rules etc, I think I've realised how one is meant to do this kind of thing. Prometheus queries have a powerful join-like capability, that can be used for things like selecting instances running a particular version of software, without putting the version as a label on everything you want to filter: https://www.robustperception.io/exposing-the-software-version-to-prometheus/ We can use that to establish a connection between disks and OSDs by emitting special metrics like this: ceph_disk_occupation{instance="mynode456", device="sdc",osd="123"} Then someone writing a dashboard can get the metrics for the disk for a known OSD ID with a query like this: node_disk_bytes_written and on (device,instance) ceph_disk_occupation{osd="123"} One can also conveniently a metric for all disks that are in use by any Ceph OSD (but not all disks in the system), by leaving off the osd=123 part. I think we can also add rack= labels (or arbitrary crush ancestry) to these nodes, so that a UI can query their disk/network stats by crush hierarchy. There is a constraint that we need the instance= labels to match up between the metrics describing the OSD disk/nic usage, and the metrics that come from node_exporter. By default the instance labels are the IP/host:port of an exporter, so one has to override them in the prometheus config to be just the hostname. That seems to be something that some people do anyway to have cleaner instance labels in UIs. I'm going to extend the ceph-mgr prometheus module to emit these metrics -- to do the same trick for network cards as for disks, we need the OSDs to tell us their interface names, so I have a PR for doing that: https://github.com/ceph/ceph/pull/16941 Cheers, John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html