Em Mon, 10 Feb 2025 17:03:54 +0100 Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> escreveu: > Em Mon, 10 Feb 2025 07:40:02 -0700 > Jonathan Corbet <corbet@xxxxxxx> escreveu: > > > Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes: > > > > > I took a look on Markus work: it was licensed under GPL 3.0 and it was > > > written in 2016. There were several changes on kerneldoc since them, > > > including the addition of a regex that it is not compatible with > > > Python re[1]: > > > > > > $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos; > > > > > > This one use: > > > > > > - recursive patterns: ?1 > > > - atomic grouping (?>...) > > > > > > Also, it is hard to map what he does with the existing script. I'm > > > opting to write a new script from scratch. > > > > That's fine, I just wanted to be sure you'd had a chance to look at > > it... > > > > > Another option would be to re-implement such regexes without using > > > such advanced patterns. > > > > Seems like a preferred option if that can be done. Banging one's head > > against all those regexes is often the hardest part of dealing with that > > script; anything that makes it simpler is welcome. > > Agreed. This one, in special, is very hard for me to understand, as I > never used recursive patterns or atomic grouping. The net result of > the struct_group*() handling is that it removes some parameters when > generating the function prototype. This is done using a complex logic > on two steps: > > # unwrap struct_group(): > # - first eat non-declaration parameters and rewrite for final match > # - then remove macro, outer parens, and trailing semicolon > $members =~ s/\bstruct_group\s*\(([^,]*,)/STRUCT_GROUP(/gos; > $members =~ s/\bstruct_group_attr\s*\(([^,]*,){2}/STRUCT_GROUP(/gos; > $members =~ s/\bstruct_group_tagged\s*\(([^,]*),([^,]*),/struct $1 $2; STRUCT_GROUP(/gos; > $members =~ s/\b__struct_group\s*\(([^,]*,){3}/STRUCT_GROUP(/gos; > $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos; > > > The first step basically eliminates some members of the function. At the > places I checked, the second step was just removing parenthesis from the > macro (and the STRUCT_GROUP name). > > I suspect that the same result could be done with a much simpler expression > like: > > $members =~ s/\bSTRUCT_GROUP\((.*)\)[^;]*;/$2/gos; > > But maybe there are some corner cases that would cause such simpler > regex to fail. A simpler regex didn't work. Instead of using Python regex module and keep such complex regular expressions, I opted to implement it on a completely different way. See patch below. Basically, I implemented a new class (NestedMatch) just to handle patterns like: FOO(.*) converting them into: .* I expect that the new logic there would work properly for any number of [], {} and () paired delimiters. I ended writing such class from scratch, based on a suggestion from stackoverflow. Using such logic, replacing STRUCT_GROUP(.*) by .* is as simple as: sub_nested_prefixes = [ (re.compile(r'\bSTRUCT_GROUP'), r'\1'), ] nested = NestedMatch() for search, sub in sub_nested_prefixes: members = nested.sub(search, sub, members) This should probably help cleaning up other similar patterns. Let's see. Anyway, I expect that this would make the C parsing part easier to maintain. For the first version, I'll use the new way only when I notice discrepancies between Perl and Python versions. Thanks, Mauro --- [PATCH] scripts/kernel-doc.py: properly handle struct_group macros Handing nested parenthesis with regular expressions is not an easy task. It is even harder with Python's re module, as it has a limited subset of regular expressions, missing more advanced features. We might use instead Python regex module, but still the regular expressions are very hard to understand. So, instead, add a logic to properly match delimiters. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py index 9b10fd05b6af..df824692c4e7 100755 --- a/scripts/kernel-doc.py +++ b/scripts/kernel-doc.py @@ -92,6 +92,138 @@ class Re: def group(self, num): return self.last_match.group(num) +class NestedMatch: + """ + Finding nested delimiters is hard with regular expressions. It is + even harder on Python with its normal re module, as there are several + advanced regular expressions that are missing. + + This is the case of this pattern: + + '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;' + + which is used to properly match open/close parenthesis of the + string search STRUCT_GROUP(), + + Add a class that counts pairs of delimiters, using it to match and + replace nested expressions. + + The original approach was suggested by: + https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex + + Although I re-implemented it to make it more generic and match 3 types + of delimiters. The logic checks if delimiters are paired. If not, it + will ignore the search string. + """ + + DELIMITER_PAIRS = { + '{': '}', + '(': ')', + '[': ']', + } + + RE_DELIM = re.compile(r'[\{\}\[\]\(\)]') + + def _search(self, regex, line): + """ + Finds paired blocks. + + If a given regex is not empty, the logic will first seek for its + position, and then handles the next delimiters afterwards + + The suggestion of using finditer to match pairs came from: + https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex + but I ended using a different implementation to align all three types + of delimiters and seek for an initial regular expression. + + The algorithm seeks for open/close paired delimiters and place them + into a stack, yielding a start/stop position of each match when the + stack is zeroed. + + The algorithm shoud work fine for properly paired lines, but will + silently ignore end delimiters that preceeds an start delimiter. + This should be OK for kernel-doc parser, as unaligned delimiters + would cause compilation errors. So, we don't need to rise exceptions + to cover such issues. + """ + + stack = [] + + for match_re in regex.finditer(line): + offset = match_re.end() + + for match in self.RE_DELIM.finditer(line[offset:]): + pos = match.start() + offset + + d = line[pos] + + if d in self.DELIMITER_PAIRS.keys(): + end = self.DELIMITER_PAIRS[d] + + stack.append(end) + continue + + # Does the end delimiter match what it is expected? + if stack and d == stack[-1]: + stack.pop() + + if not stack: + yield offset, pos + 1 + + def search(self, regex, line): + """ + This is similar to re.search: + + It matches a regex that it is followed by a delimiter, + returning occurrences only if all delimiters are paired. + """ + + for prev_pos, pos in self._search(regex, line): + + yield line[prev_pos:pos] + + def sub(self, regex, sub, line, count=0): + """ + This is similar to re.sub: + + It matches a regex that it is followed by a delimiter, + replacing occurrences only if all delimiters are paired. + + if r'\1' is used, it works just like re: it places there the + matched paired data with the delimiter stripped. + + If count is different than zero, it will replace at most count + items. + """ + out = "" + + cur_pos = 0 + n = 0 + + for start, end in self._search(regex, line): + out += line[cur_pos:start] + + # Value, ignoring start/end delimiters + value = line[start + 1:end - 1] + + # replaces \1 at the sub string, if \1 is used there + new_sub = sub + new_sub = new_sub.replace(r'\1', value) + + out += new_sub + + cur_pos = end + n += 1 + + if count and count >= n: + break + + # Append the remaining string + l = len(line) + out += line[cur_pos:l] + + return out + # # Regular expressions used to parse kernel-doc markups at KernelDoc class. # @@ -667,19 +799,40 @@ class KernelDoc: # Unwrap struct_group() based on this definition: # __struct_group(TAG, NAME, ATTRS, MEMBERS...) # which has variants like: struct_group(NAME, MEMBERS...) + # + # Parsing them are done on two steps: + # 1. drop arguments that aren't struct names; + # 2. remove STRUCT_GROUP() ancillary macro. + # + # NOTE: the original logic which replaces STRUCT_GROUP(.*) by .* + # is incompatible with Python re, as it uses: + # + # - a recursive pattern: (?1) + # - an atomic grouping: (?>...) + # + # This is the original expression: + # '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;' + # + # I tried a simpler version: but it didn't work either: + # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*; + # + # As it doesn't properly match the end parenthesis. + # + # So, a better solution was crafted: there's now a NestedMatch + # class that ensures that delimiters after a search are properly + # matched. So, the implementation to drop STRUCT_GROUP() will be + # handled in separate. (Re(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('), (Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('), (Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('), (Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('), - # This is incompatible with Python re, as it uses: - # recursive patterns ((?1)) and atomic grouping ((?>...)): - # '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;' - # Let's see if this works instead: - (Re(r'\bSTRUCT_GROUP\(([^\)]+)\)[^;]*;', re.S), r'\1'), - # Replace macros + # + # TODO: it is better to also move those to the NestedMatch logic, + # to endure that parenthesis will be properly matched. + (Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'), (Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'), (Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'), @@ -691,9 +844,18 @@ class KernelDoc: (Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S), r'__u32 \1'), ] + sub_nested_prefixes = [ + (re.compile(r'\bSTRUCT_GROUP'), r'\1'), + ] + for search, sub in sub_prefixes: members = search.sub(sub, members) + nested = NestedMatch() + + for search, sub in sub_nested_prefixes: + members = nested.sub(search, sub, members) + # Keeps the original declaration as-is declaration = members