Re: exported data schema validation

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 23 Aug 2018 15:31:06 -0700

On Thu, Aug 23, 2018 at 3:11 PM, Noah Watkins <noahwatkins@xxxxxxxxx> wrote:
> David and I had a productive conversation yesterday about validating
> the schema of data that is exported through various cli tools and
> apis. Here is a summary / proposal based on that conversation.
>
> One goal of the insights manager module is to export a consistent set
> of data to the insights engine (think data like `ceph osd dump`).
> Currently the insights module exports several json representations of
> data structure (e.g. MonMap), but this means the schema of the
> exported data is generally codified in methods like MonMap::dump,
> making it difficult for consumers of this data to monitor for schema
> changes (and likewise for developers to notice unintended schema
> changes). This isn't just an insights issue, there is a _lot_ of
> Python written that assumes a particular schema.
>
> One way to handle this is to create a centralized location containing
> schema specifications of data exported by ceph, and running tests to
> validate data sources against these expected specifications as code
> changes are made.
>
> I posted a PR (https://github.com/ceph/ceph/pull/23716) that does this
> for the generated test instances embedded in `ceph-dencoder` (e.g.
> MonMap), so that `make check` fails if a change to MonMap::dump is not
> synchronized with changes to the schema
> (https://github.com/ceph/ceph/pull/23716/commits/5fa58d30e03ad67a64feb2046dee32d753db6f20).
>
> Building schemas first for the lowest level structures is convenient
> because... the schemas for CLI commands or other APIs that embed
> nested structures may take advantage of a feature in the JSON schema
> spec that allows schema references so we can compose schemas. This
> works well to DRY on things like "utime_t".
>
> Expanding the approach taken in this PR a bit will cover most of what
> is needed for the insights manager module. Does this seem like a
> decent approach?

Unless I'm misunderstanding, we already do exactly this for the
human-readable CLI interfaces with the cram tool. (And it can be quite
annoying if you don't update the test standard on removing a CLI
option!) See the contents of ceph.git:src/test/cli and the
ceph.git:src/test/run-cli-tests script

Generally, it seems like we ought to be able to extend those to use
the json formatting rather than just human-readable output. Or are you
trying to do something more powerful?
-Greg