Re: [tip:x86/urgent] lib/ucs2_string: Correct ucs2 -> utf8 conversion

"H. Peter Anvin" <hpa@xxxxxxxxx> · Wed, 17 Feb 2016 09:15:48 -0800

On February 16, 2016 7:48:56 AM PST, tip-bot for Jason Andryuk <tipbot@xxxxxxxxx> wrote:
>Commit-ID:  a68075908a37850918ad96b056acc9ac4ce1bd90
>Gitweb:    
>http://git.kernel.org/tip/a68075908a37850918ad96b056acc9ac4ce1bd90
>Author:     Jason Andryuk <jandryuk@xxxxxxxxx>
>AuthorDate: Fri, 12 Feb 2016 23:13:33 +0000
>Committer:  Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
>CommitDate: Tue, 16 Feb 2016 12:49:05 +0000
>
>lib/ucs2_string: Correct ucs2 -> utf8 conversion
>
>The comparisons should be >= since 0x800 and 0x80 require an additional
>bit
>to store.
>
>For the 3 byte case, the existing shift would drop off 2 more bits than
>intended.
>
>For the 2 byte case, there should be 5 bits bits in byte 1, and 6 bits
>in
>byte 2.
>
>Signed-off-by: Jason Andryuk <jandryuk@xxxxxxxxx>
>Reviewed-by: Laszlo Ersek <lersek@xxxxxxxxxx>
>Cc: Peter Jones <pjones@xxxxxxxxxx>
>Cc: Matthew Garrett <mjg59@xxxxxxxxxx>
>Cc: "Lee, Chun-Yi" <jlee@xxxxxxxx>
>Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
>---
> lib/ucs2_string.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
>diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
>index 17dd74e..f0b323a 100644
>--- a/lib/ucs2_string.c
>+++ b/lib/ucs2_string.c
>@@ -59,9 +59,9 @@ ucs2_utf8size(const ucs2_char_t *src)
> 	for (i = 0; i < ucs2_strlen(src); i++) {
> 		u16 c = src[i];
> 
>-		if (c > 0x800)
>+		if (c >= 0x800)
> 			j += 3;
>-		else if (c > 0x80)
>+		else if (c >= 0x80)
> 			j += 2;
> 		else
> 			j += 1;
>@@ -88,19 +88,19 @@ ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
>unsigned long maxlength)
> 	for (i = 0; maxlength && i < limit; i++) {
> 		u16 c = src[i];
> 
>-		if (c > 0x800) {
>+		if (c >= 0x800) {
> 			if (maxlength < 3)
> 				break;
> 			maxlength -= 3;
> 			dest[j++] = 0xe0 | (c & 0xf000) >> 12;
>-			dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
>+			dest[j++] = 0x80 | (c & 0x0fc0) >> 6;
> 			dest[j++] = 0x80 | (c & 0x003f);
>-		} else if (c > 0x80) {
>+		} else if (c >= 0x80) {
> 			if (maxlength < 2)
> 				break;
> 			maxlength -= 2;
>-			dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
>-			dest[j++] = 0x80 | (c & 0x01f);
>+			dest[j++] = 0xc0 | (c & 0x7c0) >> 6;
>+			dest[j++] = 0x80 | (c & 0x03f);
> 		} else {
> 			maxlength -= 1;
> 			dest[j++] = c & 0x7f;

I also believe there is no such thing as a "ucs2 string".  This code will procedure invalid utf8 if utf16 surrogates are present; this is how the abortion called cesu8 ended up happening.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html