On Mon, 05 Feb 2024, Jonathan Corbet <corbet@xxxxxxx> wrote: > Sakari Ailus <sakari.ailus@xxxxxxxxxxxxxxx> writes: > >>> Sigh ... seeing more indecipherable regexes added to kernel-doc is like >>> seeing another load of plastic bags dumped into the ocean... it doesn't >>> change the basic situation, but it's still sad. >>> >>> Oh well, applied, thanks. >> >> Thanks. I have to say I feel the same... >> >> Regexes aren't great for parsing C, that's for sure. :-I But what are the >> options? Write a proper parser for (a subset of) C? > > Every now and then I've pondered on this a bit. There are parsers out > there, of course; we could consider using something like tree-sitter. > There's just two little problems: > > - That's a massive dependency to drag into the docs build that seems > unlikely to speed things up. > > - kernel-doc is really two parsers - one for C code, one for the > comment syntax. Strangely, nobody has written a grammar for this > combination. > > A suitably motivated developer could probably create a C+kerneldoc > grammer that would let us make a rock-solid, tree-sitter-based parser > that would be mostly maintained by somebody else. But that doesn't get > us around the "adding a big dependency" problem. After we'd made kernel-doc the perl script to produce rst, and kernel-doc the Sphinx extension to consume it, I pondered the same questions, and wondered what it should all look like if you could just ignore all the kernel legacy. I've told the story before, but what I ended up with was: - Use Python bindings for libclang to parse the source code. Clang is obviously a big dependency, but nowadays more people have it already installed, and the Python part on top is neglible. - Don't parse the contents of the comments, at all. Treat it as pure rst, and let Sphinx handle it. That's pretty much how Hawkmoth [1] got started. I never even considered it for kernel, because it would've been: > <back to work now...> Although Mesa now uses it to produce stuff like [2]. A suitably motivated developer could probably get it to work with the kernel... Nowadays you could use Sphinx mechanisms to extend it to convert kernel-doc style comments to rst. There are a number of issues that might make it difficult, though: - kernel-doc parses extra magic stuff like EXPORT_SYMBOL(). - all the special casing in kernel-doc dump_struct(), like $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos; - it's a compiler, so you'll need to pass suitable compiler options, which might be difficult with all the per-directory kbuild magic - might end up being slow, because it's a compiler (although there's some caching to avoid parsing the same file multiple times like kernel-doc currently does) Anyway, I think it would be important to separate the parsing of C and parsing of comments. It's kind of in the same bag in kernel-doc. But if you want to cross-check, say, the parameters/members against the documentation, you'll need the C AST while parsing the comments. And the preprocessor tricks employed in the kernel are probably going to be a nightmare. What I'm saying is, while Hawkmoth is perhaps not the right solution, using any generic C parser will face some of the same issues regardless. BR, Jani. [1] https://github.com/jnikula/hawkmoth/ [2] https://docs.mesa3d.org/isl/index.html -- Jani Nikula, Intel