Em Mon, 24 Feb 2025 16:38:58 -0700 Jonathan Corbet <corbet@xxxxxxx> escreveu: > Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes: > > > Maintaining kernel-doc has been a challenge, as there aren't many > > perl developers among maintainers. Also, the logic there is too > > complex. Having lots of global variables and using pure functions > > doesn't help. > > > > Rewrite the script in Python, placing most global variables > > inside classes. This should help maintaining the script in long > > term. > > [...] > > > diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py > > new file mode 100755 > > index 000000000000..5cf5ed63f215 > > --- /dev/null > > +++ b/scripts/kernel-doc.py > > @@ -0,0 +1,2757 @@ > > +#!/usr/bin/env python3 > > +# pylint: disable=R0902,R0903,R0904,R0911,R0912,R0913,R0914,R0915,R0917,R1702 > > +# pylint: disable=C0302,C0103,C0301 > > +# pylint: disable=C0116,C0115,W0511,W0613 > > +# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@xxxxxxxxxx>. > > +# SPDX-License-Identifier: GPL-2.0 > > The SPDX tag is supposed to be up top, right under the shebang I'll move it. > > I also think you should give consideration to preserving the other > copyright notices in the Perl version. A language translation doesn't > remove existing copyrights...who knows how much creativity went into > some of those regexes? Makes sense, but the copyrights at kernel-doc.pl: ## Copyright (c) 1998 Michael Zucchi, All Rights Reserved ## ## Copyright (C) 2000, 1 Tim Waugh <twaugh@xxxxxxxxxx> ## ## Copyright (C) 2001 Simon Huggins ## ## Copyright (C) 2005-2012 Randy Dunlap ## ## Copyright (C) 2012 Dan Luedtke ## ## ## ## #define enhancements by Armin Kuster <akuster@xxxxxxxxxx> ## ## Copyright (c) 2000 MontaVista Software, Inc. ## # # Copyright (C) 2022 Tomasz Warniełło (POD) Also doesn't preserve all copyrights from people that worked hard to maintain it all over those years. A quick check with git log shows 68 different authors touching it (didn't check how trivial/hard were the changes): $ git log --follow --pretty="%an" scripts/kernel-doc.pl|sort|uniq -c|sort -n 1 Alexander A. Klimov 1 Alexander Lobakin 1 Anna-Maria Behnsen 1 Bart Van Assche 1 Chen-Yu Tsai 1 Coco Li 1 Dan Luedtke 1 Donald Hunter 1 Gabriel Krisman Bertazi 1 Greg Kroah-Hartman 1 Harvey Harrison 1 Horia Geanta 1 Jason Gunthorpe 1 Jérémy Bobbio 1 Johannes Weiner 1 Jonathan Cameron 1 Kamil Rytarowski 1 Laurent Pinchart 1 Levin, Alexander (Sasha Levin) 1 Linus Torvalds 1 Lucas De Marchi 1 Mark Rutland 1 Masahiro Yamada 1 Michal Wajdeczko 1 Niklas Söderlund 1 Nishanth Menon 1 Peter Maydell 1 Pierre-Louis Bossart 1 Randy.Dunlap 1 Richard Kennedy 1 Rich Walker 1 Rolf Eike Beer 1 Silvio Fricke 1 Utkarsh Tripathi 1 valdis.kletnieks@xxxxxx 1 Will Deacon 2 Ilya Dryomov 2 Jakub Kicinski 2 Jason Baron 2 Jonathan Neuschäfer 2 Markus Heiser 2 Pavan Kumar Linga 2 Pavel Pisa 2 Sakari Ailus 2 Yacine Belkadi 2 Yujie Liu 3 Akira Yokosawa 3 André Almeida 3 Andy Shevchenko 3 Ben Hutchings 3 Borislav Petkov 3 Conchúr Navid 3 Daniel Santos 3 Danilo Cesar Lemes de Paula 3 Mike Rapoport 4 Daniel Vetter 4 Matthew Wilcox 5 Martin Waitz 5 Paolo Bonzini 6 Aditya Srivastava 6 Vegard Nossum 7 Kees Cook 11 Johannes Berg 11 Tomasz Warniełło 20 Jonathan Corbet 32 Jani Nikula 57 Mauro Carvalho Chehab 65 Randy Dunlap So, picking the latest e-mails from the above authors, maybe we can add some credit lines like these: # Converted from the kernel-doc script originally written in Perl # under GPLv2, copyrighted since 1998 by the following authors: # # Aditya Srivastava <yashsri421@xxxxxxxxx> # Akira Yokosawa <akiyks@xxxxxxxxx> # Alexander A. Klimov <grandmaster@xxxxxxxxxxxx> # Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> # André Almeida <andrealmeid@xxxxxxxxxx> # Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx> # Anna-Maria Behnsen <anna-maria@xxxxxxxxxxxxx> # Armin Kuster <akuster@xxxxxxxxxx> # Bart Van Assche <bart.vanassche@xxxxxxxxxxx> # Ben Hutchings <ben@xxxxxxxxxxxxxxx> # Borislav Petkov <bbpetkov@xxxxxxxx> # Chen-Yu Tsai <wenst@xxxxxxxxxxxx> # Coco Li <lixiaoyan@xxxxxxxxxx> # Conchúr Navid <conchur@xxxxxx> # Daniel Santos <daniel.santos@xxxxxxxxx> # Danilo Cesar Lemes de Paula <danilo.cesar@xxxxxxxxxxxxxxx> # Dan Luedtke <mail@xxxxxxxx> # Donald Hunter <donald.hunter@xxxxxxxxx> # Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx> # Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> # Harvey Harrison <harvey.harrison@xxxxxxxxx> # Horia Geanta <horia.geanta@xxxxxxxxxxxxx> # Ilya Dryomov <idryomov@xxxxxxxxx> # Jakub Kicinski <kuba@xxxxxxxxxx> # Jani Nikula <jani.nikula@xxxxxxxxx> # Jason Baron <jbaron@xxxxxxxxxx> # Jason Gunthorpe <jgg@xxxxxxxxxx> # Jérémy Bobbio <lunar@xxxxxxxxxx> # Johannes Berg <johannes.berg@xxxxxxxxx> # Johannes Weiner <hannes@xxxxxxxxxxx> # Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> # Jonathan Corbet <corbet@xxxxxxx> # Jonathan Neuschäfer <j.neuschaefer@xxxxxxx> # Kamil Rytarowski <n54@xxxxxxx> # Kees Cook <kees@xxxxxxxxxx> # Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx> # Levin, Alexander (Sasha Levin) <alexander.levin@xxxxxxxxxxx> # Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> # Lucas De Marchi <lucas.demarchi@xxxxxxxxxxxxxx> # Mark Rutland <mark.rutland@xxxxxxx> # Markus Heiser <markus.heiser@xxxxxxxxxxx> # Martin Waitz <tali@xxxxxxxxxxxxxx> # Masahiro Yamada <masahiroy@xxxxxxxxxx> # Matthew Wilcox <willy@xxxxxxxxxxxxx> # Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> # Michal Wajdeczko <michal.wajdeczko@xxxxxxxxx> # Michael Zucchi # Mike Rapoport <rppt@xxxxxxxxxxxxx> # Niklas Söderlund <niklas.soderlund@xxxxxxxxxxxx> # Nishanth Menon <nm@xxxxxx> # Paolo Bonzini <pbonzini@xxxxxxxxxx> # Pavan Kumar Linga <pavan.kumar.linga@xxxxxxxxx> # Pavel Pisa <pisa@xxxxxxxxxxxxxxxx> # Peter Maydell <peter.maydell@xxxxxxxxxx> # Pierre-Louis Bossart <pierre-louis.bossart@xxxxxxxxxxxxxxx> # Randy Dunlap <rdunlap@xxxxxxxxxxxxx> # Richard Kennedy <richard@xxxxxxxxxxxxxxx> # Rich Walker <rw@xxxxxxxxxxxxx> # Rolf Eike Beer <eike-kernel@xxxxxxxxx> # Sakari Ailus <sakari.ailus@xxxxxxxxxxxxxxx> # Silvio Fricke <silvio.fricke@xxxxxxxxx> # Simon Huggins # Tim Waugh <twaugh@xxxxxxxxxx> # Tomasz Warniełło <tomasz.warniello@xxxxxxxxx> # Utkarsh Tripathi <utripathi2002@xxxxxxxxx> # valdis.kletnieks@xxxxxx <valdis.kletnieks@xxxxxx> # Vegard Nossum <vegard.nossum@xxxxxxxxxx> # Will Deacon <will.deacon@xxxxxxx> # Yacine Belkadi <yacine.belkadi.1@xxxxxxxxx> # Yujie Liu <yujie.liu@xxxxxxxxx> Note: unfortunately, two of the original authors didn't send any patches with their names since when we migrated to git. So, I'm placing them without e-mails. On a side note, I would keep the above credits line only at the main kernel-doc.py, as it would be really hard to distribute it along the several kernel-doc classes. > > > +# TODO: implement warning filtering > > + > > +""" > > +kernel_doc > > +========== > > + > > +Print formatted kernel documentation to stdout > > + > > +Read C language source or header FILEs, extract embedded > > +documentation comments, and print formatted documentation > > +to standard output. > > + > > +The documentation comments are identified by the "/**" > > +opening comment mark. > > + > > +See Documentation/doc-guide/kernel-doc.rst for the > > +documentation comment syntax. > > +""" > > + > > +import argparse > > +import logging > > +import os > > +import re > > +import sys > > + > > +from datetime import datetime > > +from pprint import pformat > > + > > +from dateutil import tz > > + > > +# Local cache for regular expressions > > +re_cache = {} > > + > > + > > +class Re: > > So I have to say this bugs me a bit ... the class is fine, but the > one-letter case-only difference from the standard "re" class is just > going to make the code harder for others to approach. "kern_re" or > something like that? Or even "kre" if you really want it to be as short > as possible. A short name close to "re" made easier to convert the script ;-) I'll rename the class to KernRe (*) (*) I opted to follow pylint convention for class names, which is using camel case. As we don't have classes in C, this doesn't strictly conflicts with our Kernel conventions ;-) To make easier, I'll do such change as a separate patch at the end of the series. > > > + """ > > + Helper class to simplify regex declaration and usage, > > + > > + It calls re.compile for a given pattern. It also allows adding > > + regular expressions and define sub at class init time. > > + > > + Regular expressions can be cached via an argument, helping to speedup > > + searches. > > + """ > > [...] > > > + > > +class KernelDoc: > > + # Parser states > > + STATE_NORMAL = 0 # normal code > > + STATE_NAME = 1 # looking for function name > > + STATE_BODY_MAYBE = 2 # body - or maybe more description > > + STATE_BODY = 3 # the body of the comment > > + STATE_BODY_WITH_BLANK_LINE = 4 # the body which has a blank line > > + STATE_PROTO = 5 # scanning prototype > > + STATE_DOCBLOCK = 6 # documentation block > > + STATE_INLINE = 7 # gathering doc outside main block > > + > > + st_name = [ > > + "NORMAL", > > + "NAME", > > + "BODY_MAYBE", > > + "BODY", > > + "BODY_WITH_BLANK_LINE", > > + "PROTO", > > + "DOCBLOCK", > > + "INLINE", > > + ] > > So these ... kind of look like enums? Yes. I considered using Python enum there: https://docs.python.org/3/library/enum.html but it would simply be: from enum import Enum class state(enum): NORMAL = 0 # normal code NAME = 1 # looking for function name BODY_MAYBE = 2 # body - or maybe more description BODY = 3 # the body of the comment BODY_WITH_BLANK_LINE = 4 # the body which has a blank line PROTO = 5 # scanning prototype DOCBLOCK = 6 # documentation block INLINE = 7 # gathering doc outside main block and we would replace: self.STATE_NORMAL ... to: state.NORMAL (and a similar change for inline states as well) Except for an ancillary number-to-string conversion with this syntax: Color = Enum('Color', [('RED', 1), ('GREEN', 2), ('BLUE', 3)]) I didn't see much advantage of using it. See, we only use the string on a single line, when --debug is used to show the per-line state machine. Yet, perhaps we can split the states on a separate class anyway. > That's kind of it for nits ... I do have one wish that will kind of hard > to grant overall ... for the long-term maintenance of this code, it > would be really nice if every non-trivial regex were described by a > comment explaining what it is trying to do. It's not reasonable to > expect that as a condition for accepting this rewrite, but it sure would > be a nice goal to be working toward. Agreed. For the future, the best would be to request a description for every new regex added. For the future, my plan is to split the C source parser (e.g. the code that calls output_declaration() and/or modifies self.entry at the KernelDoc class inside kdoc_parser to a separate file and class. We can then work to implement some regexes to a more lexical way. For instance, all those regexes: (Re(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('), (Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('), (Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('), (Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('), (Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'), (Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'), (Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'), (Re(r'DECLARE_HASHTABLE\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[1 << ((\2) - 1)]'), (Re(r'DECLARE_KFIFO\s*\(' + args_pattern + r',\s*' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'), (Re(r'DECLARE_KFIFO_PTR\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'), (Re(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\1 \2[]'), (Re(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + args_pattern + r'\)', re.S), r'dma_addr_t \1'), (Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S), r'__u32 \1'), could be converted on a more lexical representation where the macro name with their arguments could be represented on a clearer expression. Something like: ReplaceMatch("DECLARE_HASHTABLE($1, $2)", "unsigned long $1[1 << (($2) - 1)]") We have already the NestedMatch class that do something similar to that(*) could be improved to do things like that - or we can come up with something new. (*) NestedMatch currently miss one feature: it doesn't split the contents inside parenthesis on multiple match groups. I guess it won't be hard to add such features to it. Thanks, Mauro