libblkid: udf: Incorrect implementation of Unicode strings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

Since beginning libblkid's udf code handles 16bit OSTA compressed 
unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.

In UDF 2.01 specification is written:
====
For a CompressionID of 8 or 16, the value of the CompressionID shall 
specify the number of BitsPerCharacter for the d-characters defined in 
the CharacterBitStream field. Each sequence of CompressionID bits in the 
CharacterBitStream field shall represent an OSTA Compressed Unicode d-
character. The bits of the character being encoded shall be added to the 
CharacterBitStream from most- to least-significant-bit. The bits shall 
be added to the CharacterBitStream starting from the most significant 
bit of the current byte being encoded into. The value of the OSTA 
Compressed Unicode d-character interpreted as a Uint16 defines the value 
of the corresponding d-character in the Unicode 2.0 standard.
====

So it means that 8bit OSTA compressed unicode buffer contains sequence 
of Unicode codepoints, one per 8 bits. What effectively means 
equivalence with Latin1 (ISO-8859-1) encoding.

And 16bit OSTA compressed unicode means sequence of Unicode codepoints, 
one per 16 bits in big endian. What is probably only UCS-2 and not full 
UTF-16.

So problem is with 8bit OSTA compressed unicode if contains bytes which 
are not UTF-8 invariants (ASCII). As those those are decoded differently 
with Latin1 and UTF-8.

Which means libblkid udf implementation of reading Unicode strings is 
wrong and affects all read operations (Label, UUID, ...).

To verify this problem I prepared small udf image (attached) which has 
logical volume identifier (known as label): 0x08 0xC3 0xBF 0x00 ... 0x03

According to spec it should be decoded as string "ÿ"
(LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).

But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).

I checked grub2 and Windows implementations and they show "ÿ".

So... what to do with blkid implementation? Fixing it would mean to 
break all existing labels and uuids on Linux. Not fixing it would mean 
to have different labels across different systems which implements it 
properly.

Problem appeared when I send patch for implementing same algorithm of 
UUID into grub2. (Patch was not merged yet).

Note that Linux's mkudffs from udftools generates label correctly so is 
also incompatible with blkid implementation. But because I tested only 
ASCII characters and Unicode characters above U+FF I have not detected 
this problem... (ASCII is same in UTF-8 and Latin1; and chars above U+FF 
can be encoded only as UTF-16 resp. USC-2)

-- 
Pali Rohár
pali.rohar@xxxxxxxxx

Attachment: udf.img.xz
Description: application/xz

Attachment: signature.asc
Description: This is a digitally signed message part.


[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux