"Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@xxxxxxx> writes: >> One question that I have, on the other hand: Take the version you >> shared, I want to avoid the -EINVAL for the case when strings s1 >> and s2 should match as equal, but strlen(s1) < strlen (s2). In this >> case: >> >> strncmp (s1, s2, strlen (s2)) => Returns 0. Matches Ok >> strncmp (s1, s2, strlen (s1)) => Returns -EINVAL >> >> I know -EINVAL doesn't mean they don't match, but this case seems too >> error prone. > > If I understand your question correctly, the case of interest is > > strncmp(s1, s2, len), where len <= strlen(s1) and len <= strlen(s2) > > As far as I can tell the code I sketched above handles that case in the > way you expect/want, when taking the complications introduced by > Unicode into account. Using utf8ncursor() ensures we do get an -EINVAL > if, and only if, we read beyond the end (len) of the source string as part of > the normalization process. But if we are at an acceptable boundary in the > source string when we see the end of the string, utf8byte() returns 0, > indicating a normal/non-error end of the scan. Hey Olaf, Sorry for the delay. It is not quite that scenario. The version that requires only 1 length fails when utf8ncursor receives a len that is smaller than one of the strings, which is a common case when something decomposes to a larger string: Take this case, for instance: s1 = {0xc2, 0xbc, 0x00}, /* 'VULGAR FRACTION ONE QUARTER' decomposes to */ s2 = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00}, /* 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */ If we do strncmp(s1, s2, strlen(s2)), it works fine. But if we use strlen(s1) on the third parameter, it fails. As far as I understand, the issue happens because utf8lookup will read to the maximum of len characters, aborting the lookup in the middle of a sequence. Since we don't hit a leaf for that code-point, it assumes an invalid sequence and utf8byte aborts. The easiest way to solve it is by receiving the two lens in strncmp. > I think it may be worth to write some tests to (hopefully) confirm that > the code really does what I intended it to do. The most likely case to > fail would be where you hit the len-imposed end after a codepoint > with CCC != 0. The only test I had for this scenario happened to have strlen(s1) == strlen(s2). I added the following, which I think catches this scenario: /* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK' decomposes to */ /* 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */ s1 = {0xc3, 0xa4, 0xCC, 0xA8, 0x00}, s2 = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00}, I tested your version, and it works correctly for this scenario too, as long as we set the len parameter to use the largest string, s2, instead of s1. >> I suppose we just could: >> >> (1) let the caller deal with it, which is error prone. Or, > > The caller does have to do something when it gets -EINVAL. You have to > define the desired semantics of that case. > > In the original XFS-filesystem code my choice was to treat invalid UTF-8 > sequences as binary blobs for the sake of comparisons. > >> (2) Require two lens on strncmp, one for each string, Or, > > As a general rule this is certainly correct: each string has its > own associated maximum length within which it should have > been null-terminated. So whether you need one len per > string depends on the sources of the strings. In the original > XFS-based code there are some tricks related to this. I've applied this solution and it is solving every test case correctly, including those I mentioned above. Since it looks like the best approach, I applied the other things you commented, and modified the comparison functions code to receive 2 lens. I should submit a v2 shortly, once I'm done with dealing with some changes to the fs part. Thanks! -- Gabriel Krisman Bertazi