Re: [tip:x86/urgent] lib/ucs2_string: Correct ucs2 -> utf8 conversion

Laszlo Ersek <lersek@xxxxxxxxxx> · Wed, 17 Feb 2016 19:04:51 +0100

On 02/17/16 18:15, H. Peter Anvin wrote:
> On February 16, 2016 7:48:56 AM PST, tip-bot for Jason Andryuk <tipbot@xxxxxxxxx> wrote:
>> Commit-ID:  a68075908a37850918ad96b056acc9ac4ce1bd90
>> Gitweb:    
>> http://git.kernel.org/tip/a68075908a37850918ad96b056acc9ac4ce1bd90
>> Author:     Jason Andryuk <jandryuk@xxxxxxxxx>
>> AuthorDate: Fri, 12 Feb 2016 23:13:33 +0000
>> Committer:  Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
>> CommitDate: Tue, 16 Feb 2016 12:49:05 +0000
>>
>> lib/ucs2_string: Correct ucs2 -> utf8 conversion
>>
>> The comparisons should be >= since 0x800 and 0x80 require an additional
>> bit
>> to store.
>>
>> For the 3 byte case, the existing shift would drop off 2 more bits than
>> intended.
>>
>> For the 2 byte case, there should be 5 bits bits in byte 1, and 6 bits
>> in
>> byte 2.
>>
>> Signed-off-by: Jason Andryuk <jandryuk@xxxxxxxxx>
>> Reviewed-by: Laszlo Ersek <lersek@xxxxxxxxxx>
>> Cc: Peter Jones <pjones@xxxxxxxxxx>
>> Cc: Matthew Garrett <mjg59@xxxxxxxxxx>
>> Cc: "Lee, Chun-Yi" <jlee@xxxxxxxx>
>> Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
>> ---
>> lib/ucs2_string.c | 14 +++++++-------
>> 1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
>> index 17dd74e..f0b323a 100644
>> --- a/lib/ucs2_string.c
>> +++ b/lib/ucs2_string.c
>> @@ -59,9 +59,9 @@ ucs2_utf8size(const ucs2_char_t *src)
>> 	for (i = 0; i < ucs2_strlen(src); i++) {
>> 		u16 c = src[i];
>>
>> -		if (c > 0x800)
>> +		if (c >= 0x800)
>> 			j += 3;
>> -		else if (c > 0x80)
>> +		else if (c >= 0x80)
>> 			j += 2;
>> 		else
>> 			j += 1;
>> @@ -88,19 +88,19 @@ ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
>> unsigned long maxlength)
>> 	for (i = 0; maxlength && i < limit; i++) {
>> 		u16 c = src[i];
>>
>> -		if (c > 0x800) {
>> +		if (c >= 0x800) {
>> 			if (maxlength < 3)
>> 				break;
>> 			maxlength -= 3;
>> 			dest[j++] = 0xe0 | (c & 0xf000) >> 12;
>> -			dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
>> +			dest[j++] = 0x80 | (c & 0x0fc0) >> 6;
>> 			dest[j++] = 0x80 | (c & 0x003f);
>> -		} else if (c > 0x80) {
>> +		} else if (c >= 0x80) {
>> 			if (maxlength < 2)
>> 				break;
>> 			maxlength -= 2;
>> -			dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
>> -			dest[j++] = 0x80 | (c & 0x01f);
>> +			dest[j++] = 0xc0 | (c & 0x7c0) >> 6;
>> +			dest[j++] = 0x80 | (c & 0x03f);
>> 		} else {
>> 			maxlength -= 1;
>> 			dest[j++] = c & 0x7f;
> 
> I also believe there is no such thing as a "ucs2 string".  This code will procedure invalid utf8 if utf16 surrogates are present; this is how the abortion called cesu8 ended up happening.

I raised the same concern; please see the sub-thread at:

http://thread.gmane.org/gmane.linux.kernel.efi/7366/focus=7493

If I understand correctly, the decision was that the caller would be
responsible for not passing in surrogates.

Thanks
Laszlo

--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html