Re: [PATCH v3 4/5] dtc: Drop dts source restriction for yaml output

David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> · Wed, 3 Nov 2021 15:42:06 +1100

On Wed, Oct 13, 2021 at 08:29:53PM -0500, Rob Herring wrote:
> On Wed, Oct 13, 2021 at 1:26 AM David Gibson
> <david@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Mon, Oct 11, 2021 at 08:22:54AM -0500, Rob Herring wrote:
> > > On Mon, Oct 11, 2021 at 2:19 AM David Gibson
> > > <david@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jul 27, 2021 at 12:30:22PM -0600, Rob Herring wrote:
> > > > > YAML output was restricted to dts input as there are some dependencies
> > > > > on source annotations which get lost with other input formats. With the
> > > > > addition of markers by the checks, YAML output from dtb format becomes
> > > > > more useful.
> > > > >
> > > > > Signed-off-by: Rob Herring <robh@xxxxxxxxxx>
> > > >
> > > > Urgh.  There's not really anything wrong with this patch of itself,
> > > > but it really underlines my feeling that the whole yaml output thing
> > > > is a bunch of hacks in pursuit of a bogus goal.
> > >
> > > Validating DTs is a bogus goal?
> >
> > Goal probably wasn't the right word.  Validating DTs is fine.  The
> > bogosity comes from doing the conversion to YAML essentially without
> > reference to the bindings / schemas.  Bindings describe how to
> > interpret the DT's bytestrings into meaninful numbers or whatever, so
> > using the bindings is the only reliable way of converting those
> > bytestrings into some more semantically useful representation.
> 
> That is exactly the direction I'm going.

Ok, that's good to hear.

> The YAML format can change if
> we need it to (remember the plugin interface?).

See, I find that worrying, not reassuring.  It feels like dtc is
chasing a fuzzy moving target with the yaml output.  I can see no
clear line between what parts of the decoding should be done by dtc
(in making the yaml type choices) and what parts should be done by
whatever consumes it.  Even if we could define a line, AFAICT it would
necessarily require dtc to know about *every* binding.  Not every part
of every binding, but at least part of every binding (enough to make
those type choices).

Encoding even part of every binding is an unbounded amount of work,
and not something that was ever really intended to be in dtc's scope.

Now, I realize I kind of started that fuzziness by introducing the
checks.  But there's a real difference between having some checks for
the most common errors and *requiring* annotation from the checks in
order to consume the output.  I don't see any sensible place to stop
with incorporating of this stuff into the checks, short of absorbing
the entire validation effort, which I don't think either of us wants.

In the meantime the only real spec for what dtc needs to output in
yaml mode is "what the current validation tools want", which means you
have to watch for version synchronization between dtc and the
validation tools which sounds like a real pain.

On top of that even if we had a clear boundary between "first stage"
and "second stage" validation, I think YAML has some pretty serious
drawbacks as the format for the first to communicate to the second.
The main one being that we can't safely communicate 64-bit ints across
it (since YAML is JSON-derived, its "numbers" are actually floats,
which can't safely carry integers above ~2^53).  It also can't
naturally represent "blobs" which are sometimes in dtbs, if they're
not valid Unicode.  Then there's the "Norway problem"[0].  I'm pretty
sure we quote all our strings so we won't hit that one, but it
definitely gives me the heebie-jeebies about trusting YAML parsers
with anything requiring precision.

> > > > Yaml output wants to include information that simply isn't present in
> > > > the flattened tree format (without considering bindings), so it relies
> > > > on formatting conventions in the dts, hence this test in the first
> > > > place.  This alleges it removes a restriction, but it only works if a
> > > > bunch of extra heuristics are able to guess the types correctly.
> > >
> > > The goal here is to validate dtb files which I'd think you'd be in
> > > favor of given your above opinions. For that to work, we have to
> > > transform the data into the right types somewhere.
> >
> > Yes - and that should be done with reference to specific bindings, not
> > using fragile heuristics.
> >
> > > We don't need any
> > > heuristics for that. For the most part, it is done using the
> > > definitive type information from the schemas themselves to format the
> > > data.
> >
> > Exactly.  That type information should come *from the schemas*.  Not
> > from separately maintained and fragile approximations to parts of the
> > schemas embedded into dtc.
> 
> The same can be said for every client program, too. But we're so far
> away from all knowledge about a binding flowing from a single source.
> I'd love it if we could just generate the parsing code out of the
> schemas to populate typed C structs for the OS to consume. The reality
> is that knowledge about bindings resides in multiple places and dtc is
> one of them.

That's really not true on the dtb client side.  No, we don't have
automated tooling translating a machine readable binding into code.
However, generally all the knowledge *is* in the (human readable)
binding, and the client will have a (manual) translation of all that
into code for the properties it cares about.

Automated tooling would be great, but even absent that, dtb clients
read and decode *bytestrings*, not structured data, and dtc generates
bytestrings just fine.

> > > The exception is #*-cells patterns which need to parse the tree
> > > to construct the type information. Given dtc already has all that
> > > knowledge in checks, it's easier to do it there rather than
> > > reimplement the same parsing in python.
> >
> > dtc only has parts of that knowledge in checks.  The checks have been
> > written with the assumption that in ambiguous cases we can just punt
> > and not run the check.  For the goal of truly parsing everything, the
> > current design of the checks subsystem really isn't adequate.
> 
> Yes, but handling 'foos' plus '#foo-cells' is a limited problem space

Every thing like this is a limited problem space, but there's an
unbounded number of possible things.  Like I say there's no clear
boundary to what dtc should be doing and what it shouldn't.  Given
what can be done with YAML, we're pretty much being deliberately
incomplete if dtc does anything short of reliably and correctly typing
*every* property, which in turn means knowing (part of) *every*
binding.  I'm not really willing for that to be in scope for dtc.

> compared to all bindings and not one that fits well with binding
> schemas.

Yeah.. the way what I've seen of json etc. schemas work doesn't really
mesh well with the sorts of constraints we have.  But I don't think a
messy split between "first stage" and "second stage" validation
particularly helps with that.

> dtc already knows how to parse these properties and we don't
> get new ones frequently. I'm just trying to use the knowledge that's
> already in dtc.

Again, there's a real difference betwen knowing about some of them in
order to catch the most common mistakes, and *having* to know about
all of them in order to produce correct output.

> I'm a bit worried about doing more in python too, because running
> validation on 1000+ DT files is already ~2 hours. And we're only a
> little over halfway converting bindings to schemas (though that's
> probably a long tail of older and less used bindings).

Heh.  Ok, but there's no reason you couldn't bundle a dtb->yaml
preprocessor written in C (or Rust, or Go) with the rest of the
validation tools.  Then it would be colocated with the rest of the
binding information and can be updated in lockstep.  Or better yet,
write a preprocessor that goes direct from dtb to Python native data
types, avoiding the problems with YAML.

[0] https://hitchdev.com/strictyaml/why/implicit-typing-removed/

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc

Description: PGP signature