Re: [PATCH 1/5] Add ucs2 -> utf8 helper functions

Laszlo Ersek <lersek@xxxxxxxxxx> · Fri, 12 Feb 2016 14:22:29 +0100

(I imported these messages from the gmane archive, after reading about
this work on LWN. Sorry if I'm not looking at the latest patches.)

On 02/04/16 16:34, Peter Jones wrote:
> This adds ucs2_utf8size(), which tells us how big our ucs2 string is in
> bytes, and ucs2_as_utf8, which translates from ucs2 to utf8..
> 
> Signed-off-by: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA@xxxxxxxxxxxxxxxx>
> Tested-by: Lee, Chun-Yi <jlee-IBi9RG/b67k@xxxxxxxxxxxxxxxx>
> Acked-by: Matthew Garrett <mjg59-JW9irJGTvgXQT0dZR+AlfA@xxxxxxxxxxxxxxxx>
> ---
>  include/linux/ucs2_string.h |  4 +++
>  lib/ucs2_string.c           | 62 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/include/linux/ucs2_string.h b/include/linux/ucs2_string.h
> index cbb20af..bb679b4 100644
> --- a/include/linux/ucs2_string.h
> +++ b/include/linux/ucs2_string.h
> @@ -11,4 +11,8 @@ unsigned long ucs2_strlen(const ucs2_char_t *s);
>  unsigned long ucs2_strsize(const ucs2_char_t *data, unsigned long maxlength);
>  int ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len);
>  
> +unsigned long ucs2_utf8size(const ucs2_char_t *src);
> +unsigned long ucs2_as_utf8(u8 *dest, const ucs2_char_t *src,
> +			   unsigned long maxlength);
> +
>  #endif /* _LINUX_UCS2_STRING_H_ */
> diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c
> index 6f500ef..17dd74e 100644
> --- a/lib/ucs2_string.c
> +++ b/lib/ucs2_string.c
> @@ -49,3 +49,65 @@ ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len)
>          }
>  }
>  EXPORT_SYMBOL(ucs2_strncmp);
> +
> +unsigned long
> +ucs2_utf8size(const ucs2_char_t *src)
> +{
> +	unsigned long i;
> +	unsigned long j = 0;
> +
> +	for (i = 0; i < ucs2_strlen(src); i++) {
> +		u16 c = src[i];
> +
> +		if (c > 0x800)
> +			j += 3;
> +		else if (c > 0x80)
> +			j += 2;
> +		else
> +			j += 1;
> +	}
> +
> +	return j;
> +}
> +EXPORT_SYMBOL(ucs2_utf8size);
> +
> +/*
> + * copy at most maxlength bytes of whole utf8 characters to dest from the
> + * ucs2 string src.
> + *
> + * The return value is the number of characters copied, not including the
> + * final NUL character.
> + */
> +unsigned long
> +ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength)
> +{
> +	unsigned int i;
> +	unsigned long j = 0;
> +	unsigned long limit = ucs2_strnlen(src, maxlength);
> +
> +	for (i = 0; maxlength && i < limit; i++) {
> +		u16 c = src[i];
> +
> +		if (c > 0x800) {
> +			if (maxlength < 3)
> +				break;
> +			maxlength -= 3;
> +			dest[j++] = 0xe0 | (c & 0xf000) >> 12;
> +			dest[j++] = 0x80 | (c & 0x0fc0) >> 8;
> +			dest[j++] = 0x80 | (c & 0x003f);
> +		} else if (c > 0x80) {
> +			if (maxlength < 2)
> +				break;
> +			maxlength -= 2;
> +			dest[j++] = 0xc0 | (c & 0xfe0) >> 5;
> +			dest[j++] = 0x80 | (c & 0x01f);
> +		} else {
> +			maxlength -= 1;
> +			dest[j++] = c & 0x7f;
> +		}
> +	}
> +	if (maxlength)
> +		dest[j] = '\0';
> +	return j;
> +}
> +EXPORT_SYMBOL(ucs2_as_utf8);
> 

Since this code is being added to a generic library, I have two comments
/ questions:

(1) shouldn't we handle the endianness of ucs2_char_t explicitly?

https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes

(2) Code *units* that stand for low or high halves of surrogate pairs
(0xD800 to 0xDFFF) are not treated specially; meaning the unicode code
*point* they represent (from U+10000 to U+10FFFF) is not decoded, and
then separately encoded to UTF-8. Instead, the above will transcode the
surrogates individually to UTF-8, which looks invalid.

https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points

Has this been considered?

Again, the above two questions don't seem relevant for UEFI
specifically: first, UEFI is little-endian only; second, UEFI does not
support surrogate characters -- I found a hint about this in the HII
chapter of the spec, "31.2.6.2.2 Surrogate Area".

But since this code is being added to a generic library, the UEFI
assumptions may not hold for other (future) callers.

Or does "ucs2" -- as opposed to "utf16" -- imply that the caller is
responsible for not passing in code units from the surrogate area? If
so, I think this should be spelled out in a comment as well, and maybe
even WARN'd about.

Just my two cents; I'm by no means a unicode expert.

Thanks
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe linux-efi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html