Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx> · Tue, 16 Jan 2018 14:50:33 -0200

"Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@xxxxxxx>
writes:

> But this is not the only problem. The 'len' limit applies to the input strings. So you need to tell
> the utf8byte() routine that it applies. In other words, use utf8ncursor() which takes an additional
> length parameter to set up the cursors.
>
> With this change, utf8byte() will return 0 when it hits the end of the input string due to seeing a
> null byte or having consumed all characters, provided that it is not in the middle of a utf8 sequence
> or a an incomplete sequence of Unicode characters.
>
> Finally, note that utf8byte() returns an int, not a char. It does this for the same reasons getc() does.
>
> So utf8_strncmp() becomes something like the following. I'm using EINVAL instead of EIO, and note
> that -EINVAL does not imply that str1 and str2 are not equal when compared as a sequence of bytes.
>
> static int utf8_strncmp(const struct charset *charset,
> 			const char *str1,
> 			const char *str2,
> 			int len)
> {
> 	const struct utf8data *data = utf8nfkdi(charset->version);
> 	struct utf8cursor cur1;
> 	struct utf8cursor cur2;
> 	int c1;
> 	int c2;
>
> 	if (utf8ncursor(&cur1, data, str1, len) < 0)
> 		return -EINVAL;
> 	if (utf8ncursor(&cur2, data, str2, len) < 0)
> 		return -EINVAL;
>
> 	do {
> 		c1 = utf8byte(&cur1);
> 		c2 = utf8byte(&cur2);
>
> 		if (c1 < 0 || c2 < 0)
> 			return -EINVAL;
> 		if (c1 != c2)
> 			return 1;
> 	} while (c1);
>
> 	return 0;
> }

Hi Olaf,

Thanks for your review and deep explanations.

I get your point and I've added a test case to trigger it in the
test_ucd module.

One question that I have, on the other hand: Take the version you
shared, I want to avoid the -EINVAL for the case when strings s1
and s2 should match as equal, but strlen(s1) < strlen (s2).  In this
case:

strncmp (s1, s2, strlen (s2)) => Returns 0.  Matches Ok
strncmp (s1, s2, strlen (s1)) => Returns -EINVAL

I know -EINVAL doesn't mean they don't match, but this case seems too
error prone.

I suppose we just could:

 (1) let the caller deal with it, which is error prone.  Or,

 (2) Require two lens on strncmp, one for each string, Or,

 (3) use utf8cursor for the second string, which plays bad with non-null
 terminated strings, which is important for filesystems.

Do you see an alternative? I'm pending towards option 2.  Are you ok
with that?

-- 
Gabriel Krisman Bertazi