Re: [tip:x86/urgent] lib/ucs2_string: Correct ucs2 -> utf8 conversion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/24/16 00:26, H. Peter Anvin wrote:
> On 02/17/2016 10:04 AM, Laszlo Ersek wrote:
>>>
>>> I also believe there is no such thing as a "ucs2 string".  This code will procedure invalid utf8 if utf16 surrogates are present; this is how the abortion called cesu8 ended up happening.
>>
>> I raised the same concern; please see the sub-thread at:
>>
>> http://thread.gmane.org/gmane.linux.kernel.efi/7366/focus=7493
>>
>> If I understand correctly, the decision was that the caller would be
>> responsible for not passing in surrogates.
>>
> 
> The "caller" here is the UEFI variables storage.  What do you do if a
> variable contains them?  Refuse to represent them?  Not a real option.

First, let me repeat that it's not me who needs convincing; I think I
raised this issue first.

Second, Peter's and Matt Fleming's argument was that this service is
meant to be used for UEFI purposes only, at the moment; if another
caller comes along, they will have to do the necessary modifications.

So the question becomes if UEFI variable *names* (not contents) can
contain surrogates.

The variable services in the UEFI spec (v2.6) universally use the type
(CHAR16*) for variable names (see "7.2 Variable Services"). In "2.3.1
Data Types", CHAR16 is defined as:

    2-byte Character. Unless otherwise specified all characters and
    strings are stored in the UCS-2 encoding format as defined by
    Unicode 2.1 and ISO/IEC 10646 standards.

I downloaded "ISO/IEC 10646:2014" from
<http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html>
(the zip file is only 130MB...) From the document "4th-10646-00-Main.pdf":

  4 Terms and definitions

  4.31 high-surrogate code point

    code point in the range D800 to DBFF reserved for the use of UTF-16

  4.32 high-surrogate code unit

    16-bit code unit in the range D800 to DBFF used in UTF-16 as the
    leading code unit of a surrogate pair

  4.39 low-surrogate code point

    code point in the range DC00 to DFFF reserved for the use of UTF-16

  4.40 low-surrogate code unit

    16-bit code unit in the range DC00 to DFFF used in UTF-16 as the
    trailing code unit of a surrogate pair

  4.58 UCS scalar value

    any UCS code point except high-surrogate and low-surrogate code
    points

Then, from section 9.3 ("UTF-16"):

  9.3 UTF-16

    NOTE – Former editions of this International Standard included
    references to a two-octet BMP form called UCS-2 which would be a
    subset of the UTF-16 encoding form restricted to the BMP UCS scalar
    values. The UCS-2 form is deprecated.

Therefore, the name of this interface (ucs2_as_utf8) correctly reflects
its functionality -- it is not supposed to convert from generic UTF-16,
only from the UCS-2 encoding form. That is a subset of UTF-16. UCS-2 is
restricted to UCS scalar values, which means it doesn't have surrogates.

The question is then whether this function is usable to convert UEFI
variable names. According to the UEFI standard, it is. (And at the
moment, the function is not supposed to be used by callers other than
UEFI code, according to what Matt Fleming said.)

Now, if a variable name *happens* to contain a high or low surrogate
code unit -- violating the UEFI standard! --, then the conversion will
output a UTF-8 sequence directly encoding the corresponding surrogate
code point. Even for such (invalid) inputs, the function doesn't decay
to undefined behavior; it is safe, and the output uniquely maps to the
input.

Let's consider the use of the converted output for comparison purposes.
No *standard* variable name (encoded as UCS-2) will contain surrogate
code units (that would be in breach of the UEFI spec), so there is no
need to represent them as UTF-8 arrays (constants) in the kernel code,
as long as you would like to compare variable names against standardized
variable names.

Nevertheless, if you would like to represent variable names whose UCS-2
encoding contains surrogate code units (breaking the UEFI spec), that is
still possible: store the unique UTF-8 sequence that ucs2_as_utf8()
transforms the (spec-breaking) variable name into.

Laszlo
--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Stable Commits]     [Linux Stable Kernel]     [Linux Kernel]     [Linux USB Devel]     [Linux Video &Media]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux