Re: [PATCH v2 0/5] Fun with cpp word regex

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 09 Oct 2021 02:00:12 +0200

On Sat, Oct 09 2021, Johannes Sixt wrote:

> Am 08.10.21 um 22:07 schrieb Ævar Arnfjörð Bjarmason:

[re-arranged a bit]

> [...]As far as tokenization is concerned, C
> is a subset of C++. I don't think we need to separate the drivers.
>
>>  * I found myself back-porting some of your tests (manually mostly),
>>    maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
>>    easier to follow if we first added the test data, and then the
>>    changed behavior.
>> 
>>    Because after all, we're going to change how we highlight existing
>>    data, so testing for that would be informative.
>
> Good point. I'll work a bit more on that.

Great!

>>  * I wonder if it isn't time to split up "cpp" into a "c" driver,
>>    e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
>>    adds more syntax sugar.
>> 
>>    So e.g. if you use "<=>" after this series we'll tokenize it
>>    differently in *.c files, but it's a C++-only operator, on the other
>>    hand probably nobody cares that much...
>
> Yes, it is that: <=> won't appear in a correct C file (outside of
> comments), so no-one will care. [...]

..mmm..

>>  * This pre-dates your much improved tests, but these test files could
>>    really use some test comments, as in:
>> 
>>    /* Now that we're going to understand the "'" character somehow, will any of this change? */
>>    /* We haven't written code like this since the 1960's ... */
>>    /* Run & free */
>> 
>>    I.e. we don't just highlight code the compiler likes to eat, but also
>>    comments. So particularly for smaller tokens that also occur in
>>    natural language like "'" and "&" are we getting expected results?
>
> Comments are free text. Anything can happen. There is no such thing as
> "correct tokenization" in comments. Not interested.

Sure there is, just because the problem is fuzzy doesn't mean there
aren't more and less correct things to do.

But most importantly the output of "git diff" is made for human
consumption, people who use --word-diff are going to be looking at code
that contains comments, embedded natural language in C-strings etc.

I've got no reason to think that your changes here make it worse, but
just as a general matter we absolutely should consider that one of the
top priorities when it comes to these language drivers.

E.g. in some languages (like CJK) it's common for characters or words
not to have any unicode whitespace between them, so even a word-diff
mode for C can benefit from recognizing those character ranges and
splitting them appropriately, or at least trying.

So to take a comment of yours where you changed a comment at random:

    $ git log -U0 --oneline -1 --word-diff -p af920e36977 --word-diff-regex='[ a-zA-Z]*'
    [...]
    [- Note that this seemingly redundant second declaration is required-]{+ Note that this redundant forward declaration is required+}

Don't you think that would suck v.s. the now-behavior of:

    [...]
    Note that this[-seemingly-] redundant [-second-]{+forward+} declaration is required

The former is exactly the sort of thing you'd get in CJK languages with
a word-diff driver thought the line we should stop at was the same as a
comiler tokenizer.

Anyway, the cpp driver seems to do just fine on CJK.

I'm just saying that as a general thing it's definitely a priority for
these drivers to not only handle the narrow cases of tokens a compiler
would know about. Text that people commonly use should be presented in
some way that isn't line noise.

For an example of something we do a bit badly with the cpp driver is
parts of my 66f5f6dca95 (C style: use standard style for "TRANSLATORS"
comments, 2017-05-11).

I.e. there I was changing a comment format, and added a full stop to a
sentence, the word-diff is:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after [-it-]{+it.+}
         */

Even though it has nothing to do with C syntax per-se that would be much more useful as:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after it{+.+}
         */

I.e. treating a "." at the end of a word specially isn't C or C++
syntax, but it's absolutely input that the cpp driver *is* getting and
should be if possible be handling well.

I just did that by experimenting with
--word-diff-regex='([A-Za-z:]*|\*|\.)', that example is unchanged with
your series, but maybe low-hanging fruit....

Thanks for working on this, just an unsolicited braindump :)