On 02/24/16 00:26, H. Peter Anvin wrote: > On 02/17/2016 10:04 AM, Laszlo Ersek wrote: >>> >>> I also believe there is no such thing as a "ucs2 string". This code will procedure invalid utf8 if utf16 surrogates are present; this is how the abortion called cesu8 ended up happening. >> >> I raised the same concern; please see the sub-thread at: >> >> http://thread.gmane.org/gmane.linux.kernel.efi/7366/focus=7493 >> >> If I understand correctly, the decision was that the caller would be >> responsible for not passing in surrogates. >> > > The "caller" here is the UEFI variables storage. What do you do if a > variable contains them? Refuse to represent them? Not a real option. First, let me repeat that it's not me who needs convincing; I think I raised this issue first. Second, Peter's and Matt Fleming's argument was that this service is meant to be used for UEFI purposes only, at the moment; if another caller comes along, they will have to do the necessary modifications. So the question becomes if UEFI variable *names* (not contents) can contain surrogates. The variable services in the UEFI spec (v2.6) universally use the type (CHAR16*) for variable names (see "7.2 Variable Services"). In "2.3.1 Data Types", CHAR16 is defined as: 2-byte Character. Unless otherwise specified all characters and strings are stored in the UCS-2 encoding format as defined by Unicode 2.1 and ISO/IEC 10646 standards. I downloaded "ISO/IEC 10646:2014" from <http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html> (the zip file is only 130MB...) From the document "4th-10646-00-Main.pdf": 4 Terms and definitions 4.31 high-surrogate code point code point in the range D800 to DBFF reserved for the use of UTF-16 4.32 high-surrogate code unit 16-bit code unit in the range D800 to DBFF used in UTF-16 as the leading code unit of a surrogate pair 4.39 low-surrogate code point code point in the range DC00 to DFFF reserved for the use of UTF-16 4.40 low-surrogate code unit 16-bit code unit in the range DC00 to DFFF used in UTF-16 as the trailing code unit of a surrogate pair 4.58 UCS scalar value any UCS code point except high-surrogate and low-surrogate code points Then, from section 9.3 ("UTF-16"): 9.3 UTF-16 NOTE – Former editions of this International Standard included references to a two-octet BMP form called UCS-2 which would be a subset of the UTF-16 encoding form restricted to the BMP UCS scalar values. The UCS-2 form is deprecated. Therefore, the name of this interface (ucs2_as_utf8) correctly reflects its functionality -- it is not supposed to convert from generic UTF-16, only from the UCS-2 encoding form. That is a subset of UTF-16. UCS-2 is restricted to UCS scalar values, which means it doesn't have surrogates. The question is then whether this function is usable to convert UEFI variable names. According to the UEFI standard, it is. (And at the moment, the function is not supposed to be used by callers other than UEFI code, according to what Matt Fleming said.) Now, if a variable name *happens* to contain a high or low surrogate code unit -- violating the UEFI standard! --, then the conversion will output a UTF-8 sequence directly encoding the corresponding surrogate code point. Even for such (invalid) inputs, the function doesn't decay to undefined behavior; it is safe, and the output uniquely maps to the input. Let's consider the use of the converted output for comparison purposes. No *standard* variable name (encoded as UCS-2) will contain surrogate code units (that would be in breach of the UEFI spec), so there is no need to represent them as UTF-8 arrays (constants) in the kernel code, as long as you would like to compare variable names against standardized variable names. Nevertheless, if you would like to represent variable names whose UCS-2 encoding contains surrogate code units (breaking the UEFI spec), that is still possible: store the unique UTF-8 sequence that ucs2_as_utf8() transforms the (spec-breaking) variable name into. Laszlo -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
![]() |