Re: [RFC PATCH v1 2/6] kernel-doc: replace kernel-doc perl parser with a pure python one (WIP)

Markus Heiser <markus.heiser@xxxxxxxxxxx> · Fri, 27 Jan 2017 10:46:05 +0100

Am 26.01.2017 um 20:26 schrieb Jani Nikula <jani.nikula@xxxxxxxxx>:

> On Thu, 26 Jan 2017, Jonathan Corbet <corbet@xxxxxxx> wrote:
>> Give me a new kerneldoc that passes those tests, and I'll happily
>> merge it.  (I have some sympathy with the idea that we should look
>> into other parsers, but I would not hold up a new kerneldoc that
>> passed those tests on this basis alone.)
> 
> I'll just note in passing that having another parser that actually works
> for our needs might be a pink unicorn pony. It might exist, it might
> not, and someone would have to put in the hours to try to find it, tame
> it, and bring it to the kernel. But it would be awesome to
> have. Switching to a homebrew Python parser first does not preclude a
> unicorn hunt later.

Here are my experience about parsing C code and kernel-doc comments.

The reg-expressions divide into two parts:

a.) those parsing "C sources", catching up function prototypes, structs etc. and

b.) those parsing "kernel-doc comments", catching up attribute descriptions,
   cross references etc.

When I developed the py-version in my POC I realized that the reg-expressions
parsing C sources (a.) aren't so bad. They have a long history and are well
tested against kernel' sources (As far as I remember, I added only one regexp
more to match function prototypes).

 This was the time where I looked at some other parsing tool and
 after a day I throw away the idea of using a external parser
 tool, first.

Most problems I have had, was parsing the kernel-doc markup itself. E.g. the
ambiguous attribute markup "* @foo: lorem" and its cross-ref "@foo". The latter
syntax is ambiguous, it fails mostly on new-lines and with strings like
"me@xxxxxxx".

When I looked at the whole sources, I also realized that we have two flavors of
kernel-doc markups.

b.1) Those from traditional DocBook where whitespaces aren't markups and

b.2) those which has been rewritten with reST markup in, where whitespaces are a
    part of the reST.

But this was only the half truth of b.2) : the 'new' markup did not only
consists of pure reST markup. For convince it is a mix of kernel-doc markup and
reST markup (e.g. remember the cross-ref mentioned above).

I suppose that we will never completely get rid off traditional (b.1), since
this means; changing the whole kernel source ;)

At that time I wanted to implement a parser which has the ability to handle both
flavors. A (undocumented) 'vintage' mode an the user-documented 'reST' mode.
But what is the criteria to switch from one mode to the other?  For this I made
a primitive assumption: every C source file which is used in a ".. kernel-doc::"
directive has to be marked up with the modern reST flavor.

ATM, the py-version of kernel-doc implements the same state machine as the perl
one and the modes are implemented in the same state machine (not perfect but it
worked for me first, suppose we can make it better).

I remember about a very early discussion we had about those modes and I know
that it doesn't find friends in the community (at that time).  May be today we
have more experience and new ideas.

  I really like to see (to work on) a parser with we can parse the
  whole kernel source and generate reST from.

What do you think, is it a bloody idea?

Thanks!

-- Markus --

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html