Re: libblkid: udf: Incorrect implementation of Unicode strings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Monday 15 May 2017 12:09:40 Karel Zak wrote:
> On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote:
> > Hi!
> > 
> > Since beginning libblkid's udf code handles 16bit OSTA compressed
> > unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8.
> > 
> > In UDF 2.01 specification is written:
> > ====
> > For a CompressionID of 8 or 16, the value of the CompressionID
> > shall specify the number of BitsPerCharacter for the d-characters
> > defined in the CharacterBitStream field. Each sequence of
> > CompressionID bits in the CharacterBitStream field shall represent
> > an OSTA Compressed Unicode d- character. The bits of the character
> > being encoded shall be added to the CharacterBitStream from most-
> > to least-significant-bit. The bits shall be added to the
> > CharacterBitStream starting from the most significant bit of the
> > current byte being encoded into. The value of the OSTA Compressed
> > Unicode d-character interpreted as a Uint16 defines the value of
> > the corresponding d-character in the Unicode 2.0 standard. ====
> > 
> > So it means that 8bit OSTA compressed unicode buffer contains
> > sequence of Unicode codepoints, one per 8 bits. What effectively
> > means equivalence with Latin1 (ISO-8859-1) encoding.
> > 
> > And 16bit OSTA compressed unicode means sequence of Unicode
> > codepoints, one per 16 bits in big endian. What is probably only
> > UCS-2 and not full UTF-16.
> > 
> > So problem is with 8bit OSTA compressed unicode if contains bytes
> > which are not UTF-8 invariants (ASCII). As those those are decoded
> > differently with Latin1 and UTF-8.
> > 
> > Which means libblkid udf implementation of reading Unicode strings
> > is wrong and affects all read operations (Label, UUID, ...).
> > 
> > To verify this problem I prepared small udf image (attached) which
> > has logical volume identifier (known as label): 0x08 0xC3 0xBF
> > 0x00 ... 0x03
> > 
> > According to spec it should be decoded as string "ÿ"
> > (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK).
> > 
> > But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS).
> > 
> > I checked grub2 and Windows implementations and they show "ÿ".
> > 
> > So... what to do with blkid implementation? Fixing it would mean to
> > break all existing labels and uuids on Linux. Not fixing it would
> > mean to have different labels across different systems which
> > implements it properly.
> 
> The issue has never been reported, so I guess the number of the
> affected LABELs is pretty small :-)
> 
> From my point of view it would be better to follow the standard, fix
> the issue and be compatible with the another utils and systems. It
> would be nice to fix it now for v2.30 where we already have changes
> in udf/iso stuff. Please, send the patch :-)

Fix for all UDF strings except UUID is in this pull request:
https://github.com/karelzak/util-linux/pull/438

I hope it is correct now. UDF image with "ÿ" is added to tests.

-- 
Pali Rohár
pali.rohar@xxxxxxxxx

Attachment: signature.asc
Description: This is a digitally signed message part.


[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux