Re: [PATCH 1/3] test-ctype: test isascii

René Scharfe <l.s.r@xxxxxx> · Mon, 13 Feb 2023 19:37:15 +0100

Am 13.02.23 um 04:39 schrieb Junio C Hamano:
> René Scharfe <l.s.r@xxxxxx> writes:
>
>> Am 11.02.23 um 20:48 schrieb Junio C Hamano:
>>> René Scharfe <l.s.r@xxxxxx> writes:
>>>
>>>> Test the character classifier added by c2e9364a06 (cleanup: add
>>>> isascii(), 2009-03-07).  It returns 1 for NUL as well, which requires
>>>> special treatment, as our string-based tester can't find it with
>>>> strcmp(3).  Allow NUL to be given as the first character in a class
>>>> specification string.  This has the downside of no longer supporting
>>>> the empty string, but that's OK since we are not interested in testing
>>>> character classes with no members.
>>>
>>> I wonder how effective a test we can have by checking a table we use
>>> in production (i.e. ctype.c::sane_ctype[]) against another table we
>>> use only for testing (i.e. string literals in test-ctype.c), but
>>> that is not something new in this series.
>>
>> What aspect is left uncovered?
>>
>> Or do you mean that the production table should be made trivially
>> readable to avoid having to test at all?
>
> Much closer to the latter but not quite.
>
> Both tables are not all that transparent, and it feels that the
> protection by the tests largely depends on the fact that it is less
> likely for us to make the same mistake in two "not so crystal clear"
> tables for the same byte.

The test strings for islower() and isupper() I wrote down from memory
long ago, I think.  They should be easily verifiable, like the new ones
for isxdigit().  The one for isascii() is a bit tiring, but verifying it
against the man page which says that the characters from value 0 to 0177
are included should be feasible.  The ones for iscntrl() and ispunct() I
got from their man page.

But when it came to isprint() I got lazy and just copied from ctype.c --
you got me there.  A more intuitive representation could be:

   " " LOWER UPPER DIGIT PUNCT

In my experience having two copies already helps when modifying one of
them -- but at least at some point we better check them against an
external source of truth.

The ctype.c version needs to be fast, so we probably have to make some
concessions to readability.  I'd love to be proven wrong on that,
though.

>> ... but parsing commit messages and blob
>> payloads should perhaps better be done with locale-aware versions
>> with multi-byte character support.
>
> Yes, that does make sense but it is orthogonal to what sane_ctype
> wants to address, I would think.

Currently we can only use one or the other variant because our sane
versions use the same names as the locale-aware ones.  Full overlap
instead of orthogonality.  I don't know if there is a practical impact
besides not recognizing function lines that start with umlauts etc.
for diff hunk headers, though.

René