Hi all, I've had a chat with Sage & Dan Mick about the current state of telemetry, and I'd like to propose a few ideas to hopefully improve it and make the data collected more relevant. The current data is quite limited. I was able to take a look at, say, how many pools out there (well, of the ~300ish clusters that ever reported) have a non-2^n pg_num, but seeing whether this affects performance or data distribution was not possible. My goal is to have telemetry data that allows us to make more informed decisions about what matters to the user base; the comments below are not necessarily ordered by relevance, since they grew out of a thread on looking at the current data reported. Curious about your thoughts - too detailed information? Anything you'd like to see included? What'd help you in your area? - The crash section does expose actual hostnames ("entity_name"). If we want to preserve that we can see whether it's the same entity crashing or another, I'd propose that, similar to report_id, we generate a report_secret_salt in the plugin that we don't share with the server - we can then use this to hash any potential strings consistently. (This will change with Sage's pending PR to point this at a different channel.) - The pool reporting should include: - EC policy (plugin, parameters) - I can tell whether a pool is EC, and k+m, but not even k or m individually ... - Pool application association (and it'd be lovely if we could tell data/metadata pools apart for CephFS/RBD) - Possibly per-pool usage? - Report should included enabled plugins - Plugins should have a standard API call to report their own telemetry - e.g., balancer/pg_autoscaler settings come to mind - The current way how the ceph version/os/distro/kernel/description/cpu/arch fields are aggregated individually makes these very difficult to analyze. In case you're not familiar, it looks something like this (trimmed): "kernel_version": { "4%15%0-54-generic": 6, "4%15%0-50-generic": 20, "4%18%0-25-generic": 3 }, "ceph_version": { "ceph version 14%2%1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 29 }, "kernel_description": { "#58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019": 6, "#54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019": 20, "#26~18%04%1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019": 3 }, "cpu": { "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3%50GHz": 20, "Intel(R) Core(TM) i7-7700 CPU @ 3%60GHz": 9 } } I'd rather see it aggregated at the tuple level: environment: [ { kernel_version: "4%15%0-54-generic", arch: "x86_64", distro: "ubuntu", cpu: "Intel(R) Core(TM) i7-7700 CPU @ 3%60GHz", kernel_description: "...", ceph_version: "...", count: 6 }, ... ] - The OSD section could be revamped and expose more details. This is overly simplified. Is BlueStore used with rotational_media? NVMe? SSD? Is FileStore? On which media and, possibly, how big are the WAL/RocksDB/data partitions? Is encryption used? Are these ceph-volume, ceph-disk, ...? Which file system is used with Filestore? Do we have enabled uring? Etc. (In short, probably ceph osd metadata should grow to encompass this and telemetry would scrape a subset of that.) - While we're on hardware, I'd like to know if there's a separate cluster/public network and if we can deduce the hardware that's associated with (10? 25? VLAN? bond? etc) - Are there any msgr features we'd want to know about? v2? Encryption? - Anything on the MDS? - RFC: include "ceph features"? - There's no actual performance data (commit latency or anything else). Could we grab a histogram or at least min/max/avg/stddev/sum(?) of some high-level metrics since the last report from the prometheus instance most more recent environments would likely have? (I'd like to see if we can deduce that a certain update made the clusters in the field slower or faster.) - I'd love to see data on OSD utilization/variance as well. (I could have used that this morning to check how this varied for clusters with non-2^n pg_num, but it'd also help us show the improvement over time as folks roll out the new automations etc.) We can either grab this from the OSD daemons, or again ask Prometheus. - Do we know anything about the client versions talking to us, beyond require_min_compat_client? - We may want to get more details on the services/gateways (iSCSI, NFS, CIFS). Even just if they're used would be good. - I'd pull contact/organization/description into a separate section and channel. We'll need to also document what this information is used for. Basically, this is a long laundry list of wishes for more detail. ;-) I'm wondering what the best way of tracking all wishes and then deciding on which to fulfil is. Regards, Lars -- SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg) "Architects should open possibilities and not determine everything." (Ueli Zbinden) _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx