Hi Nícolas, Em Tue, 8 Jun 2021 22:43:06 -0300 Nícolas F. R. A. Prado <n@xxxxxxxxxxxxx> escreveu: > Hi, > > a while back Matthew suggested adding automatic markup of known constants, like > NULL and error codes, as literals [1]. This is a great idea since those literals > are very frequently used throughout the documentation and are fixed names, so we > can easily match them without false positives and make the source files less > cluttered. > > Patch 1 adds that feature to automarkup.py, while patch 2 removes the no longer > needed markup from the xarray.rst file which was given as an example. > > Some things I'd like to discuss: > > * As a first draft I added the constants I found on xarray.rst plus all error > codes from errno-base.h and errno.h. We don't need to start with everything, > but we can certainly do better than this. What are common constants that > should be added here? (Matthew mentioned ANSI C or POSIX, some reference to > the constants in those would be great, even more if they are easily parsable) > > * The Literals list added is already a bit big with just the error codes, so > perhaps we should move them to a separate plain text file that is read on > startup? If we're willing to keep a long list, they should be parsed from files, instead of directly included. For things like errno-base.h/errno.h, the better would be if automarkup would read them directly from the header file, as otherwise there will be a maintainance burden. Yet, a regex like "-E[A-Z\d]+\b" would likely get all cases, but this will produce false positives, like EXAMPLE, EXAMPLES and some other non-trivial exceptions. With regards to NULL, I would just use a trivial regex like: "\bNULL\b" (but see below my notes about using "\b"). > * There was also mention of automatically converting uppercase words to > literals. I'm not so sure about that one so I left it out for now. > > The example given was XA_MARK_0, which is a #define in xarray.h. The way > to go here is certainly to first use kernel-doc to get it included in the > documentation, and then cross-reference to it. FWICT at the moment kernel-doc > for defines should be done like for functions (but the parenthesis can be > omitted, although they show up in the output): > > /** > * XA_MARK_0 - Brief description. > * > * Description of the type. > */ > > At the current state, the cross-reference would then need to be done either > through plain :c:macro:`XA_MARK_0`, or, by relying on automarkup, typedef > XA_MARK_0 (which is not ideal since this isn't a typedef). Options for > improvements would be to add 'macro' as a possible prefix (so eg. macro > XA_MARK_0), or go all out and try to cross-reference on every uppercase word > like suggested. We should strive for what is the most natural to write, as > long as the regex isn't too broad. > > Since automarkup.py is opt-out rather than opt-in, I think we should be extra > careful not to make the regexes too broad, even if in this case it relies on a > C symbol being present. > > One other idea I had was that we could limit to all uppercase words as > long as it has an _ on it, since I don't expect we would get false positives > with that. The downside is that it might be very confusing to people why their > MACRONAME isn't being automatically cross-referenced to... What it can be done would be to first check/apply cross-references. Only if it fails, then replace everything in uppercase to literals. There is a drawback, tough: this would cause texts in uppercase. We used to have lots of them before ReST conversion. I won't doubt that some files would have kept a few of them. So, in this case, the regex would need to require at least one _, e. g. something like: \b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b Again, tests are needed in order to check if the regex won't get anything that shouldn't be caught. So, I would grep the source codes in order to check if the regexes won't bring false positives, e. g.: $ git grep -E "\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b" Documentation/ Btw, the "\b" at the end won't actually work, due to things like: GLOBAL_GEN_STORAGE{0:3} GAUDI_ENGINE_ID_* If you take a look at the scripts/get_abi.pl that I wrote, I faced the same problem there with \b. So, I ended implementing my own set of \b: my $start = qr {(^|\s|\() }x; my $bondary = qr { ([,.:;\)\s]|\z) }x; Using the above replacements for \b, I would do something like this to double-check if the regex won't be producing false-positives: $ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$1 \e[0;31m$2\e[0;37m $3\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i; done Btw, on a quick look, this specific regex: (^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z) Seems to be producing just one false positive: a sequence of _, e. g.: \b\_+\b While we could make the regex more complex, I would just test if the second match is '^\_+$', skipping it if found. On other words, a python-equivalent to this: $ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq - That's said, it is worth to also mention the false negatives: $ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq >OK_list $ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE " "; done|sort|uniq >FULL_list $ grep -v -f OK_list FULL_list >excluded_list $ wc -l OK_list FULL_list excluded_list 7398 OK_list 14487 FULL_list 7520 excluded_list 29405 total There are a lot of constants that won't be parsed if we require at least one '_'. Looking at the excluded_list, indeed there are things there which should be excluded, like EXAMPLE, EXAMPLES, EXCEPTION, WITH, WITHIN, WITHOUT..., but also there are a several matches that are constants, like KERNELRELEASE, DISCONTIGMEM, SIOCGIFINDEX, SIGALRM, SETREGSET, CDROMREADCOOKED, CDROMREADRAW, ... I doubt that there would be an easy way to handle such cases, as a file with a ~7000 constants would be a maintenance nightmare. --- In summary, my suggestion is that we should stay out of having a big list of constants. So, I would start with: 1. a regex to get FOO_BAR cases, like: (^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z) with a test to exclude this pattern: ^\_+$ 2. a logic that will pick and use errno codes from the errno*.h files; 3. an specific handler handler for the NULL special case. This would get 7000+ different constants, which seems a very good start. If needed, later we could get a few other relevant constants by checking the most-used ones with something like: for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -v "_"; done|sort|uniq -c|sort -n in order to grab the most common ones. Regards, Mauro