Re: Stability of prometheus/perf counter names

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 28 Feb 2018 11:32:46 -0800

On Wed, Feb 28, 2018 at 4:26 AM, John Spray <jspray@xxxxxxxxxx> wrote:
> Inevitably, we're starting to hit cases where we have to think about
> compatibility when making changes to the prometheus output (same
> issues will apply to changes to perf counters that are passed
> through):
> https://github.com/ceph/ceph/pull/20506#issuecomment-368806208
>
> For the moment, we don't have any policy around this, so I anticipate
> things changing at will until the point that we make a policy, which
> might naturally coincide with introducing the in-tree grafana
> dashboards (because at that point the prometheus output will be
> demonstrably reasonably complete).
>
> Thinking about what kind of policy we want, the extremes would be:
> - Do nothing: all counters can change at will (even though most of the
> time they won't)
> - Match Ceph protocol interop: all changes would be backwards
> compatible through two versions
>
> We will soon have some official grafana dashboards in the Ceph tree,
> which I anticipate most people using, but there will certainly be
> people who craft their own dashboards too.  I'm hoping that the
> vendors shipping Ceph-based products will all be working with the
> in-tree dashboards, so this is probably more a topic of concern to
> large scale users.
>
> I think it's reasonable for people with custom dashboards to expect
> that we not knowingly break them with updates to our stable branches:
> that's a pretty easy thing for us to accomplish.
>
> The part that's probably more debatable is: should someone with an
> external dashboard built for luminous expect it to work seamlessly on
> mimic?  I would say probably not.  Because we make major internal
> changes between major releases, it's expected that various performance
> counters would go away or change in meaningful ways.
>
> Any thoughts?

This sounds reasonable to me, but....

Perfcounters are, by their nature, intimately tied to the specific
implementation details of any particular thing they're looking at. If
we expose the size of a waiting queue, we're telling people that queue
exists. If we tell them that it's a good thing to graph, we're telling
them the size of that queue is a good warning sign for some statistic
they actually care about.

So, if we demand that perfcounters remain stable (across any time
period), we're actually demanding that whatever we're exposing not
change meaningfully. Generally speaking, that *should* be true within
a stable release — we don't take stuff away — but sometimes things do
change. For instance, we changed the implementation of scrub and
snapshot sleeps so that they go on a waitlist rather than eating up an
op thread, and we did that in (several?) stable releases. It would be
reasonable to have perfcounters for both of those states[1], and you
could coerce them into having the same name, but the meaning and
interpretation would be radically different.

So, I'm not sure a strict policy is actually a good idea here, despite
my expectation that it wouldn't come up as an issue often.

Looking to your example though, we could certainly say that we don't
change names or structures unless the meaning changes. That seems like
a reasonable split between making work for dashboard developers,
versus restricting our ability to mutate the system meaningfully.
-Greg
[1]: I'm not sure if we actually did or do. State-based counters would
be a good idea though, assuming we continue exposing stuff at the
developer/code-structure level.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html