Re: Getting the JSON schema of commands

Noah Watkins <nwatkins@xxxxxxxxxx> · Thu, 8 Nov 2018 11:49:12 -0800

On Thu, Nov 8, 2018 at 2:04 AM Erwan Velu <evelu@xxxxxxxxxx> wrote:
>
> Thanks Noah & Zack for your answers.
>
> In fact what I need here is being able, for a given version, to
> anticipate for a given version what will be the json structure of a
> given command.
>
> Let's consider "ceph osd metadata -f".
>
> When I'm doing this
> (https://github.com/ErwanAliasr1/skydive/blob/1fa8c596823bcc53dd7fcecec8c9a529514a2a88/topology/probes/ceph/osd.go#L91),
> on a 12.2.5, I get
> https://github.com/ErwanAliasr1/skydive/blob/1fa8c596823bcc53dd7fcecec8c9a529514a2a88/topology/probes/ceph/osd.go#L39
>
> For OSDs, that's pretty trivial but when you consider running a ceph -s
> -f json, I end up with
> https://github.com/ErwanAliasr1/skydive/blob/1fa8c596823bcc53dd7fcecec8c9a529514a2a88/topology/probes/ceph/cluster.go#L37
> and that's not still complete as I didn't run every part of ceph
> component on this cluster.
>
> For a 3rd party tool like mine, but surely also for the manager or some
> other ones, we should be able to anticipate what will be the expected
> scheme for a release.
>
> If I understood correctly your PR, I'm not sure it does cover the output
> of all those commands a user/3rd party can call.

You're right that the PR I referenced doesn't exactly solve your
issue, but I do think that it may a significant step towards what you
are looking for. Indeed, I have a similar (if not exact) challenge
with the on-going work of integrating Ceph with Red Hat Insights.

Nearly all of the output produced by Ceph CLI commands using `-f json`
are driven by a JSON serialization of an internal data structure (or
combinations of data structures). The PR I posted takes a bottom-up
approach to tackling this higher level goal that you and I have. The
PR associates a fully defined JSON schema with each data structure,
and adds a unit test that is part of `make test` that automates
verification of the schema. It is bottom-up in the sense that these
structures are found embedded in the final output of a Ceph CLI `-f
json` request. This turns out to be super convenient because the JSON
schema standard allows schema composition. For example, if the
serialization of a structure is nested in the output of other
structures or CLI outputs, then the nested schemas can be re-used.

When a CLI command output corresponds exactly to one of these
low-level structures (as is sometimes the case now) then things are
easy. As John points out, some of the CLI output is built up
programmatically. In these cases the output often contains many
instances of structures _with_ schemas, but the top-level schema may
have an ad-hoc structure. However, all is not lost! I think there is a
fairly simple, albeit tedious, way forward:

1) cli schemas

Following the bottom-up approach, the next step may be to add schema's
for the CLI methods that reuse the low-level structure schemas when
possible.

1.a) the ideal option. instead of using a free-form / ad-hoc
construction of JSON output for CLI commands, define some new internal
data structures that are built-up to contain the final output. This
structure is then easy to associate with a serialization method and
schema. this is ideal because we can define a covering set of
structure instantiations that are much much much much harder to build
through putting a cluster into a particular state such that a given
output is produced.

1.b) the quicker option (opinions may differ). in principle there is
still a deterministic covering set of possible schemas, which means
that there is a schema, and it just needs to be teased out by reading
through the possible code paths for a particular CLI.

2) versioning

In the general case of a mixed version cluster it would seem that
either (1) a data source must expose its version or (2) data must be
tagged with a version. Based on what John mentioned "the OSDs
themselves are passing up a map of
strings to strings. Similarly, the servicemap is basically freeform."
this level of indirection seems to suggestion that (2) is really the
only option. However, this too should be easy. Nearly all of the
structures that are serialized to JSON form also have associated
serialization methods for dumping out binary encodings---which
themselves have access to version information so that data is self
identifiable (at least that's my understanding).

3) schema publishing

This probably depends heavily on the user of the schema. The simplest
use case is verifying with unit tests, for a specific version, if CLI
output matches the associated schema. For other programmatic tools, I
think that in order to handle hybrid clusters easily, a new meta-level
manager (monitor?) command should be created that exposes the covering
set of schema's found in the cluster. For other users, schema's could
be published along with docs or an *-dev[el] package.

Erwan, I realize that's a lot of brain dump there. I think this is a
really important topic as Ceph is integrated into more and more places
that need machine readable output!

- Noah

> So I wonder If I'm using the right interface or the right way to collect
> information about a ceph cluster. If I want to make a structured
> representation of various ceph releases (and containers will generate
> this situation), I wonder how to handle that :/
>
> Le 07/11/2018 à 21:33, Noah Watkins a écrit :
> > Hey Erwan,
> >
> > This sounds similar to something I started recently, but haven't been
> > able to finish completely. Although, it's actually probably pretty
> > close to being able to merge. Let me know if it seems like it might
> > help out and we can work out what's needed to handle your case..
> >
> >    https://github.com/ceph/ceph/pull/23716
> >
> > - Noah
> > On Wed, Nov 7, 2018 at 7:11 AM Erwan Velu <evelu@xxxxxxxxxx> wrote:
> >> Hi list,
> >>
> >> I'm working on a tool that reads the json output of several ceph commands.
> >>
> >> To ease the parsing, I've been choosing the JSON format which guarantee
> >> a parseable output.
> >>
> >> I'm using the Unmarshall feature of golang to map this output in an
> >> internal data structure so every member of this JSON output is easily
> >> reachable from the code.
> >>
> >> That works pretty well except that I have to "anticipate" what will the
> >> be members and their types.
> >>
> >> To do that, I've been transforming the sample output of my Ceph cluster
> >> (luminous) into a data struct with https://mholt.github.io/json-to-go/
> >>
> >> That works fine unless that I would be a complete output to get every
> >> possible combination of the json output.
> >>
> >>
> >> So instead of reversing the json output, and that per version of Ceph as
> >> versions can change the format, how can I extract the complete JSON
> >> schema from each command ?
> >>
> >>
> >> Erwan,
> >>