Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx> · Tue, 23 Jan 2018 01:33:06 -0200

"Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@xxxxxxx>
writes:

>> One question that I have, on the other hand: Take the version you
>> shared, I want to avoid the -EINVAL for the case when strings s1
>> and s2 should match as equal, but strlen(s1) < strlen (s2).  In this
>> case:
>> 
>> strncmp (s1, s2, strlen (s2)) => Returns 0.  Matches Ok
>> strncmp (s1, s2, strlen (s1)) => Returns -EINVAL
>> 
>> I know -EINVAL doesn't mean they don't match, but this case seems too
>> error prone.
>
> If I understand your question correctly, the case of interest is
>
> 	strncmp(s1, s2, len), where len <= strlen(s1) and len <= strlen(s2)
>
> As far as I can tell the code I sketched above handles that case in the
> way you expect/want, when taking the complications introduced by
> Unicode into account. Using utf8ncursor() ensures we do get an -EINVAL
> if, and only if, we read beyond the end (len) of the source string as part of
> the normalization process. But if we are at an acceptable boundary in the
> source string when we see the end of the string, utf8byte() returns 0,
> indicating a normal/non-error end of the scan.

Hey Olaf,

Sorry for the delay.

It is not quite that scenario.  The version that requires only 1 length
fails when utf8ncursor receives a len that is smaller than one of the
strings, which is a common case when something decomposes to a larger
string:

Take this case, for instance:

s1 = {0xc2, 0xbc, 0x00},   /* 'VULGAR FRACTION ONE QUARTER' decomposes to */
s2 = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00},  /* 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */

If we do strncmp(s1, s2, strlen(s2)), it works fine.  But if we use
strlen(s1) on the third parameter, it fails.  As far as I understand,
the issue happens because utf8lookup will read to the maximum of len
characters, aborting the lookup in the middle of a sequence. Since we
don't hit a leaf for that code-point, it assumes an invalid sequence and
utf8byte aborts.

The easiest way to solve it is by receiving the two lens in strncmp.

> I think it may be worth to write some tests to (hopefully) confirm that
> the code really does what I intended it to do. The most likely case to
> fail would be where you hit the len-imposed end after a codepoint
> with CCC != 0.

The only test I had for this scenario happened to have strlen(s1) ==
strlen(s2).  I added the following, which I think catches this scenario:

/* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK'  decomposes to */
/* 'LETTER A' + 'COMBINING  OGONEK' + 'COMBINING DIAERESIS' */
  s1 = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
  s2 = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},

I tested your version, and it works correctly for this scenario too, as
long as we set the len parameter to use the largest string, s2, instead
of s1.

>> I suppose we just could:
>> 
>>  (1) let the caller deal with it, which is error prone.  Or,
>
> The caller does have to do something when it gets -EINVAL. You have to
> define the desired semantics of that case.
>
> In the original XFS-filesystem code my choice was to treat invalid UTF-8
> sequences as binary blobs for the sake of comparisons.
>
>>  (2) Require two lens on strncmp, one for each string, Or,
>
> As a general rule this is certainly correct: each string has its
> own associated maximum length within which it should have
> been null-terminated.  So whether you need one len per
> string depends on the sources of the strings. In the original
> XFS-based code there are some tricks related to this.

I've applied this solution and it is solving every test case correctly,
including those I mentioned above.  Since it looks like the best
approach, I applied the other things you commented, and modified the
comparison functions code to receive 2 lens.  I should submit a v2
shortly, once I'm done with dealing with some changes to the fs part.

Thanks!

-- 
Gabriel Krisman Bertazi