RFC: Telemetry revamp

Lars Marowsky-Bree <lmb@xxxxxxxx> · Thu, 25 Jul 2019 17:48:04 +0200

Hi all,

I've had a chat with Sage & Dan Mick about the current state of
telemetry, and I'd like to propose a few ideas to hopefully improve it
and make the data collected more relevant.

The current data is quite limited. I was able to take a look at, say,
how many pools out there (well, of the ~300ish clusters that ever
reported) have a non-2^n pg_num, but seeing whether this affects
performance or data distribution was not possible.

My goal is to have telemetry data that allows us to make more informed
decisions about what matters to the user base; the comments below are
not necessarily ordered by relevance, since they grew out of a thread on
looking at the current data reported.

Curious about your thoughts - too detailed information? Anything you'd
like to see included? What'd help you in your area?

- The crash section does expose actual hostnames ("entity_name"). If we
  want to preserve that we can see whether it's the same entity crashing
  or another, I'd propose that, similar to report_id, we generate a
  report_secret_salt in the plugin that we don't share with the server -
  we can then use this to hash any potential strings consistently.

  (This will change with Sage's pending PR to point this at a different
  channel.)

- The pool reporting should include:
  - EC policy (plugin, parameters)
    - I can tell whether a pool is EC, and k+m, but not even k or m
      individually ...
  - Pool application association (and it'd be lovely if we could tell
    data/metadata pools apart for CephFS/RBD)
  - Possibly per-pool usage?

- Report should included enabled plugins
  - Plugins should have a standard API call to report their own telemetry
  - e.g., balancer/pg_autoscaler settings come to mind

- The current way how the ceph
  version/os/distro/kernel/description/cpu/arch fields are aggregated
  individually makes these very difficult to analyze. In case
  you're not familiar, it looks something like this (trimmed):

        "kernel_version": {
          "4%15%0-54-generic": 6,
          "4%15%0-50-generic": 20,
          "4%18%0-25-generic": 3
        },
        "ceph_version": {
          "ceph version 14%2%1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 29
        },
        "kernel_description": {
          "#58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019": 6,
          "#54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019": 20,
          "#26~18%04%1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019": 3
        },
        "cpu": {
          "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3%50GHz": 20,
          "Intel(R) Core(TM) i7-7700 CPU @ 3%60GHz": 9
        }
      }

  I'd rather see it aggregated at the tuple level:

  environment: [
    {
      kernel_version: "4%15%0-54-generic",
      arch: "x86_64",
      distro: "ubuntu",
      cpu: "Intel(R) Core(TM) i7-7700 CPU @ 3%60GHz",
      kernel_description: "...",
      ceph_version: "...",
      count: 6
    },
    ...
  ]

- The OSD section could be revamped and expose more details.  This is
  overly simplified. Is BlueStore used with rotational_media?  NVMe?
  SSD? Is FileStore? On which media and, possibly, how big are the
  WAL/RocksDB/data partitions?

  Is encryption used? Are these ceph-volume, ceph-disk, ...? Which file
  system is used with Filestore? Do we have enabled uring? Etc.

  (In short, probably ceph osd metadata should grow to encompass this
  and telemetry would scrape a subset of that.)

- While we're on hardware, I'd like to know if there's a separate
  cluster/public network and if we can deduce the hardware that's
  associated with (10? 25? VLAN? bond? etc)

- Are there any msgr features we'd want to know about? v2? Encryption?

- Anything on the MDS?

- RFC: include "ceph features"?

- There's no actual performance data (commit latency or anything else).
  Could we grab a histogram or at least min/max/avg/stddev/sum(?) of
  some high-level metrics since the last report from the prometheus
  instance most more recent environments would likely have?

  (I'd like to see if we can deduce that a certain update made the
  clusters in the field slower or faster.)

- I'd love to see data on OSD utilization/variance as well. (I could
  have used that this morning to check how this varied for clusters with
  non-2^n pg_num, but it'd also help us show the improvement over time
  as folks roll out the new automations etc.)

  We can either grab this from the OSD daemons, or again ask Prometheus.

- Do we know anything about the client versions talking to us, beyond
  require_min_compat_client?

- We may want to get more details on the services/gateways (iSCSI, NFS,
  CIFS). Even just if they're used would be good.

- I'd pull contact/organization/description into a separate section and
  channel. We'll need to also document what this information is used
  for.

Basically, this is a long laundry list of wishes for more detail. ;-)

I'm wondering what the best way of tracking all wishes and then deciding
on which to fulfil is.

Regards,
    Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli Zbinden)
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx