(I imported these messages from the gmane archive, after reading about this work on LWN. Sorry if I'm not looking at the latest patches.) On 02/04/16 16:34, Peter Jones wrote: > This adds ucs2_utf8size(), which tells us how big our ucs2 string is in > bytes, and ucs2_as_utf8, which translates from ucs2 to utf8.. > > Signed-off-by: Peter Jones <pjones-H+wXaHxf7aLQT0dZR+AlfA@xxxxxxxxxxxxxxxx> > Tested-by: Lee, Chun-Yi <jlee-IBi9RG/b67k@xxxxxxxxxxxxxxxx> > Acked-by: Matthew Garrett <mjg59-JW9irJGTvgXQT0dZR+AlfA@xxxxxxxxxxxxxxxx> > --- > include/linux/ucs2_string.h | 4 +++ > lib/ucs2_string.c | 62 +++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 66 insertions(+) > > diff --git a/include/linux/ucs2_string.h b/include/linux/ucs2_string.h > index cbb20af..bb679b4 100644 > --- a/include/linux/ucs2_string.h > +++ b/include/linux/ucs2_string.h > @@ -11,4 +11,8 @@ unsigned long ucs2_strlen(const ucs2_char_t *s); > unsigned long ucs2_strsize(const ucs2_char_t *data, unsigned long maxlength); > int ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len); > > +unsigned long ucs2_utf8size(const ucs2_char_t *src); > +unsigned long ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, > + unsigned long maxlength); > + > #endif /* _LINUX_UCS2_STRING_H_ */ > diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c > index 6f500ef..17dd74e 100644 > --- a/lib/ucs2_string.c > +++ b/lib/ucs2_string.c > @@ -49,3 +49,65 @@ ucs2_strncmp(const ucs2_char_t *a, const ucs2_char_t *b, size_t len) > } > } > EXPORT_SYMBOL(ucs2_strncmp); > + > +unsigned long > +ucs2_utf8size(const ucs2_char_t *src) > +{ > + unsigned long i; > + unsigned long j = 0; > + > + for (i = 0; i < ucs2_strlen(src); i++) { > + u16 c = src[i]; > + > + if (c > 0x800) > + j += 3; > + else if (c > 0x80) > + j += 2; > + else > + j += 1; > + } > + > + return j; > +} > +EXPORT_SYMBOL(ucs2_utf8size); > + > +/* > + * copy at most maxlength bytes of whole utf8 characters to dest from the > + * ucs2 string src. > + * > + * The return value is the number of characters copied, not including the > + * final NUL character. > + */ > +unsigned long > +ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength) > +{ > + unsigned int i; > + unsigned long j = 0; > + unsigned long limit = ucs2_strnlen(src, maxlength); > + > + for (i = 0; maxlength && i < limit; i++) { > + u16 c = src[i]; > + > + if (c > 0x800) { > + if (maxlength < 3) > + break; > + maxlength -= 3; > + dest[j++] = 0xe0 | (c & 0xf000) >> 12; > + dest[j++] = 0x80 | (c & 0x0fc0) >> 8; > + dest[j++] = 0x80 | (c & 0x003f); > + } else if (c > 0x80) { > + if (maxlength < 2) > + break; > + maxlength -= 2; > + dest[j++] = 0xc0 | (c & 0xfe0) >> 5; > + dest[j++] = 0x80 | (c & 0x01f); > + } else { > + maxlength -= 1; > + dest[j++] = c & 0x7f; > + } > + } > + if (maxlength) > + dest[j] = '\0'; > + return j; > +} > +EXPORT_SYMBOL(ucs2_as_utf8); > Since this code is being added to a generic library, I have two comments / questions: (1) shouldn't we handle the endianness of ucs2_char_t explicitly? https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes (2) Code *units* that stand for low or high halves of surrogate pairs (0xD800 to 0xDFFF) are not treated specially; meaning the unicode code *point* they represent (from U+10000 to U+10FFFF) is not decoded, and then separately encoded to UTF-8. Instead, the above will transcode the surrogates individually to UTF-8, which looks invalid. https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points Has this been considered? Again, the above two questions don't seem relevant for UEFI specifically: first, UEFI is little-endian only; second, UEFI does not support surrogate characters -- I found a hint about this in the HII chapter of the spec, "31.2.6.2.2 Surrogate Area". But since this code is being added to a generic library, the UEFI assumptions may not hold for other (future) callers. Or does "ucs2" -- as opposed to "utf16" -- imply that the caller is responsible for not passing in code units from the surrogate area? If so, I think this should be spelled out in a comment as well, and maybe even WARN'd about. Just my two cents; I'm by no means a unicode expert. Thanks Laszlo -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html