Re: [PATCH v2 09/39] scripts/kernel-doc.py: add a Python parser

Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> · Tue, 25 Feb 2025 08:38:14 +0100

Em Mon, 24 Feb 2025 16:38:58 -0700
Jonathan Corbet <corbet@xxxxxxx> escreveu:

> Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes:
> 
> > Maintaining kernel-doc has been a challenge, as there aren't many
> > perl developers among maintainers. Also, the logic there is too
> > complex. Having lots of global variables and using pure functions
> > doesn't help.
> >
> > Rewrite the script in Python, placing most global variables
> > inside classes. This should help maintaining the script in long
> > term.  
> 
> [...]
> 
> > diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py
> > new file mode 100755
> > index 000000000000..5cf5ed63f215
> > --- /dev/null
> > +++ b/scripts/kernel-doc.py
> > @@ -0,0 +1,2757 @@
> > +#!/usr/bin/env python3
> > +# pylint: disable=R0902,R0903,R0904,R0911,R0912,R0913,R0914,R0915,R0917,R1702
> > +# pylint: disable=C0302,C0103,C0301
> > +# pylint: disable=C0116,C0115,W0511,W0613
> > +# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@xxxxxxxxxx>.
> > +# SPDX-License-Identifier: GPL-2.0  
> 
> The SPDX tag is supposed to be up top, right under the shebang

I'll move it. 

> 
> I also think you should give consideration to preserving the other
> copyright notices in the Perl version.  A language translation doesn't
> remove existing copyrights...who knows how much creativity went into
> some of those regexes?

Makes sense, but the copyrights at kernel-doc.pl:

	## Copyright (c) 1998 Michael Zucchi, All Rights Reserved        ##
	## Copyright (C) 2000, 1  Tim Waugh <twaugh@xxxxxxxxxx>          ##
	## Copyright (C) 2001  Simon Huggins                             ##
	## Copyright (C) 2005-2012  Randy Dunlap                         ##
	## Copyright (C) 2012  Dan Luedtke                               ##
	##                                                               ##
	## #define enhancements by Armin Kuster <akuster@xxxxxxxxxx>     ##
	## Copyright (c) 2000 MontaVista Software, Inc.                  ##
	#
	# Copyright (C) 2022 Tomasz Warniełło (POD)

Also doesn't preserve all copyrights from people that worked hard to
maintain it all over those years.

A quick check with git log shows 68 different authors touching it
(didn't check how trivial/hard were the changes):

$ git log --follow --pretty="%an" scripts/kernel-doc.pl|sort|uniq -c|sort -n
      1 Alexander A. Klimov
      1 Alexander Lobakin
      1 Anna-Maria Behnsen
      1 Bart Van Assche
      1 Chen-Yu Tsai
      1 Coco Li
      1 Dan Luedtke
      1 Donald Hunter
      1 Gabriel Krisman Bertazi
      1 Greg Kroah-Hartman
      1 Harvey Harrison
      1 Horia Geanta
      1 Jason Gunthorpe
      1 Jérémy Bobbio
      1 Johannes Weiner
      1 Jonathan Cameron
      1 Kamil Rytarowski
      1 Laurent Pinchart
      1 Levin, Alexander (Sasha Levin)
      1 Linus Torvalds
      1 Lucas De Marchi
      1 Mark Rutland
      1 Masahiro Yamada
      1 Michal Wajdeczko
      1 Niklas Söderlund
      1 Nishanth Menon
      1 Peter Maydell
      1 Pierre-Louis Bossart
      1 Randy.Dunlap
      1 Richard Kennedy
      1 Rich Walker
      1 Rolf Eike Beer
      1 Silvio Fricke
      1 Utkarsh Tripathi
      1 valdis.kletnieks@xxxxxx
      1 Will Deacon
      2 Ilya Dryomov
      2 Jakub Kicinski
      2 Jason Baron
      2 Jonathan Neuschäfer
      2 Markus Heiser
      2 Pavan Kumar Linga
      2 Pavel Pisa
      2 Sakari Ailus
      2 Yacine Belkadi
      2 Yujie Liu
      3 Akira Yokosawa
      3 André Almeida
      3 Andy Shevchenko
      3 Ben Hutchings
      3 Borislav Petkov
      3 Conchúr Navid
      3 Daniel Santos
      3 Danilo Cesar Lemes de Paula
      3 Mike Rapoport
      4 Daniel Vetter
      4 Matthew Wilcox
      5 Martin Waitz
      5 Paolo Bonzini
      6 Aditya Srivastava
      6 Vegard Nossum
      7 Kees Cook
     11 Johannes Berg
     11 Tomasz Warniełło
     20 Jonathan Corbet
     32 Jani Nikula
     57 Mauro Carvalho Chehab
     65 Randy Dunlap

So, picking the latest e-mails from the above authors, maybe we can add
some credit lines like these:

# Converted from the kernel-doc script originally written in Perl
# under GPLv2, copyrighted since 1998 by the following authors:
#
#    Aditya Srivastava <yashsri421@xxxxxxxxx>
#    Akira Yokosawa <akiyks@xxxxxxxxx>
#    Alexander A. Klimov <grandmaster@xxxxxxxxxxxx>
#    Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
#    André Almeida <andrealmeid@xxxxxxxxxx>
#    Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx>
#    Anna-Maria Behnsen <anna-maria@xxxxxxxxxxxxx>
#    Armin Kuster <akuster@xxxxxxxxxx>
#    Bart Van Assche <bart.vanassche@xxxxxxxxxxx>
#    Ben Hutchings <ben@xxxxxxxxxxxxxxx>
#    Borislav Petkov <bbpetkov@xxxxxxxx>
#    Chen-Yu Tsai <wenst@xxxxxxxxxxxx>
#    Coco Li <lixiaoyan@xxxxxxxxxx>
#    Conchúr Navid <conchur@xxxxxx>
#    Daniel Santos <daniel.santos@xxxxxxxxx>
#    Danilo Cesar Lemes de Paula <danilo.cesar@xxxxxxxxxxxxxxx>
#    Dan Luedtke <mail@xxxxxxxx>
#    Donald Hunter <donald.hunter@xxxxxxxxx>
#    Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx>
#    Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
#    Harvey Harrison <harvey.harrison@xxxxxxxxx>
#    Horia Geanta <horia.geanta@xxxxxxxxxxxxx>
#    Ilya Dryomov <idryomov@xxxxxxxxx>
#    Jakub Kicinski <kuba@xxxxxxxxxx>
#    Jani Nikula <jani.nikula@xxxxxxxxx>
#    Jason Baron <jbaron@xxxxxxxxxx>
#    Jason Gunthorpe <jgg@xxxxxxxxxx>
#    Jérémy Bobbio <lunar@xxxxxxxxxx>
#    Johannes Berg <johannes.berg@xxxxxxxxx>
#    Johannes Weiner <hannes@xxxxxxxxxxx>
#    Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
#    Jonathan Corbet <corbet@xxxxxxx>
#    Jonathan Neuschäfer <j.neuschaefer@xxxxxxx>
#    Kamil Rytarowski <n54@xxxxxxx>
#    Kees Cook <kees@xxxxxxxxxx>
#    Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx>
#    Levin, Alexander (Sasha Levin) <alexander.levin@xxxxxxxxxxx>
#    Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
#    Lucas De Marchi <lucas.demarchi@xxxxxxxxxxxxxx>
#    Mark Rutland <mark.rutland@xxxxxxx>
#    Markus Heiser <markus.heiser@xxxxxxxxxxx>
#    Martin Waitz <tali@xxxxxxxxxxxxxx>
#    Masahiro Yamada <masahiroy@xxxxxxxxxx>
#    Matthew Wilcox <willy@xxxxxxxxxxxxx>
#    Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
#    Michal Wajdeczko <michal.wajdeczko@xxxxxxxxx>
#    Michael Zucchi
#    Mike Rapoport <rppt@xxxxxxxxxxxxx>
#    Niklas Söderlund <niklas.soderlund@xxxxxxxxxxxx>
#    Nishanth Menon <nm@xxxxxx>
#    Paolo Bonzini <pbonzini@xxxxxxxxxx>
#    Pavan Kumar Linga <pavan.kumar.linga@xxxxxxxxx>
#    Pavel Pisa <pisa@xxxxxxxxxxxxxxxx>
#    Peter Maydell <peter.maydell@xxxxxxxxxx>
#    Pierre-Louis Bossart <pierre-louis.bossart@xxxxxxxxxxxxxxx>
#    Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
#    Richard Kennedy <richard@xxxxxxxxxxxxxxx>
#    Rich Walker <rw@xxxxxxxxxxxxx>
#    Rolf Eike Beer <eike-kernel@xxxxxxxxx>
#    Sakari Ailus <sakari.ailus@xxxxxxxxxxxxxxx>
#    Silvio Fricke <silvio.fricke@xxxxxxxxx>
#    Simon Huggins
#    Tim Waugh <twaugh@xxxxxxxxxx> 
#    Tomasz Warniełło <tomasz.warniello@xxxxxxxxx>
#    Utkarsh Tripathi <utripathi2002@xxxxxxxxx>
#    valdis.kletnieks@xxxxxx <valdis.kletnieks@xxxxxx>
#    Vegard Nossum <vegard.nossum@xxxxxxxxxx>
#    Will Deacon <will.deacon@xxxxxxx>
#    Yacine Belkadi <yacine.belkadi.1@xxxxxxxxx>
#    Yujie Liu <yujie.liu@xxxxxxxxx>

Note: unfortunately, two of the original authors didn't send any patches
with their names since when we migrated to git. So, I'm placing them
without e-mails.

On a side note, I would keep the above credits line only at the main
kernel-doc.py, as it would be really hard to distribute it along
the several kernel-doc classes.

> 
> > +# TODO: implement warning filtering
> > +
> > +"""
> > +kernel_doc
> > +==========
> > +
> > +Print formatted kernel documentation to stdout
> > +
> > +Read C language source or header FILEs, extract embedded
> > +documentation comments, and print formatted documentation
> > +to standard output.
> > +
> > +The documentation comments are identified by the "/**"
> > +opening comment mark.
> > +
> > +See Documentation/doc-guide/kernel-doc.rst for the
> > +documentation comment syntax.
> > +"""
> > +
> > +import argparse
> > +import logging
> > +import os
> > +import re
> > +import sys
> > +
> > +from datetime import datetime
> > +from pprint import pformat
> > +
> > +from dateutil import tz
> > +
> > +# Local cache for regular expressions
> > +re_cache = {}
> > +
> > +
> > +class Re:  
> 
> So I have to say this bugs me a bit ... the class is fine, but the
> one-letter case-only difference from the standard "re" class is just
> going to make the code harder for others to approach.  "kern_re" or
> something like that?  Or even "kre" if you really want it to be as short
> as possible.

A short name close to "re" made easier to convert the script ;-)

I'll rename the class to KernRe (*)

(*) I opted to follow pylint convention for class names, which
    is using camel case. As we don't have classes in C, this
    doesn't strictly conflicts with our Kernel conventions ;-)

To make easier, I'll do such change as a separate patch at the end
of the series.

> 
> > +    """
> > +    Helper class to simplify regex declaration and usage,
> > +
> > +    It calls re.compile for a given pattern. It also allows adding
> > +    regular expressions and define sub at class init time.
> > +
> > +    Regular expressions can be cached via an argument, helping to speedup
> > +    searches.
> > +    """  
> 
> [...]
> 
> > +
> > +class KernelDoc:
> > +    # Parser states
> > +    STATE_NORMAL        = 0        # normal code
> > +    STATE_NAME          = 1        # looking for function name
> > +    STATE_BODY_MAYBE    = 2        # body - or maybe more description
> > +    STATE_BODY          = 3        # the body of the comment
> > +    STATE_BODY_WITH_BLANK_LINE = 4 # the body which has a blank line
> > +    STATE_PROTO         = 5        # scanning prototype
> > +    STATE_DOCBLOCK      = 6        # documentation block
> > +    STATE_INLINE        = 7        # gathering doc outside main block
> > +
> > +    st_name = [
> > +        "NORMAL",
> > +        "NAME",
> > +        "BODY_MAYBE",
> > +        "BODY",
> > +        "BODY_WITH_BLANK_LINE",
> > +        "PROTO",
> > +        "DOCBLOCK",
> > +        "INLINE",
> > +    ]  
> 
> So these ... kind of look like enums?

Yes. I considered using Python enum there:
	https://docs.python.org/3/library/enum.html

but it would simply be:

	from enum import Enum

	class state(enum):
	    NORMAL        = 0        # normal code
	    NAME          = 1        # looking for function name
	    BODY_MAYBE    = 2        # body - or maybe more description
	    BODY          = 3        # the body of the comment
	    BODY_WITH_BLANK_LINE = 4 # the body which has a blank line
	    PROTO         = 5        # scanning prototype
	    DOCBLOCK      = 6        # documentation block
	    INLINE        = 7        # gathering doc outside main block

and we would replace:

	self.STATE_NORMAL
	...

to:
	state.NORMAL

(and a similar change for inline states as well)

Except for an ancillary number-to-string conversion with this syntax:

	Color = Enum('Color', [('RED', 1), ('GREEN', 2), ('BLUE', 3)])

I didn't see much advantage of using it. See, we only use the string
on a single line, when --debug is used to show the per-line state machine.

Yet, perhaps we can split the states on a separate class anyway.

> That's kind of it for nits ... I do have one wish that will kind of hard
> to grant overall ... for the long-term maintenance of this code, it
> would be really nice if every non-trivial regex were described by a
> comment explaining what it is trying to do.  It's not reasonable to
> expect that as a condition for accepting this rewrite, but it sure would
> be a nice goal to be working toward.

Agreed. For the future, the best would be to request a description for
every new regex added.

For the future, my plan is to split the C source parser (e.g. the code that
calls output_declaration() and/or modifies self.entry at the KernelDoc
class inside kdoc_parser to a separate file and class. We can then work
to implement some regexes to a more lexical way. For instance, all those
regexes:

            (Re(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
            (Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
            (Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
            (Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),

            (Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
            (Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
            (Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
            (Re(r'DECLARE_HASHTABLE\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
            (Re(r'DECLARE_KFIFO\s*\(' + args_pattern + r',\s*' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
            (Re(r'DECLARE_KFIFO_PTR\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
            (Re(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\1 \2[]'),
            (Re(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + args_pattern + r'\)', re.S), r'dma_addr_t \1'),
            (Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S), r'__u32 \1'),

could be converted on a more lexical representation where the macro
name with their arguments could be represented on a clearer expression.

Something like:

	ReplaceMatch("DECLARE_HASHTABLE($1, $2)", "unsigned long $1[1 << (($2) - 1)]")

We have already the NestedMatch class that do something similar to
that(*) could be improved to do things like that - or we can come up with
something new.

(*) NestedMatch currently miss one feature: it doesn't split the contents
    inside parenthesis on multiple match groups. I guess it won't be hard to
    add such features to it.

Thanks,
Mauro