On Monday 15 May 2017 12:09:40 Karel Zak wrote: > On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote: > > Hi! > > > > Since beginning libblkid's udf code handles 16bit OSTA compressed > > unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8. > > > > In UDF 2.01 specification is written: > > ==== > > For a CompressionID of 8 or 16, the value of the CompressionID > > shall specify the number of BitsPerCharacter for the d-characters > > defined in the CharacterBitStream field. Each sequence of > > CompressionID bits in the CharacterBitStream field shall represent > > an OSTA Compressed Unicode d- character. The bits of the character > > being encoded shall be added to the CharacterBitStream from most- > > to least-significant-bit. The bits shall be added to the > > CharacterBitStream starting from the most significant bit of the > > current byte being encoded into. The value of the OSTA Compressed > > Unicode d-character interpreted as a Uint16 defines the value of > > the corresponding d-character in the Unicode 2.0 standard. ==== > > > > So it means that 8bit OSTA compressed unicode buffer contains > > sequence of Unicode codepoints, one per 8 bits. What effectively > > means equivalence with Latin1 (ISO-8859-1) encoding. > > > > And 16bit OSTA compressed unicode means sequence of Unicode > > codepoints, one per 16 bits in big endian. What is probably only > > UCS-2 and not full UTF-16. > > > > So problem is with 8bit OSTA compressed unicode if contains bytes > > which are not UTF-8 invariants (ASCII). As those those are decoded > > differently with Latin1 and UTF-8. > > > > Which means libblkid udf implementation of reading Unicode strings > > is wrong and affects all read operations (Label, UUID, ...). > > > > To verify this problem I prepared small udf image (attached) which > > has logical volume identifier (known as label): 0x08 0xC3 0xBF > > 0x00 ... 0x03 > > > > According to spec it should be decoded as string "ÿ" > > (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK). > > > > But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS). > > > > I checked grub2 and Windows implementations and they show "ÿ". > > > > So... what to do with blkid implementation? Fixing it would mean to > > break all existing labels and uuids on Linux. Not fixing it would > > mean to have different labels across different systems which > > implements it properly. > > The issue has never been reported, so I guess the number of the > affected LABELs is pretty small :-) > > From my point of view it would be better to follow the standard, fix > the issue and be compatible with the another utils and systems. It > would be nice to fix it now for v2.30 where we already have changes > in udf/iso stuff. Please, send the patch :-) Fix for all UDF strings except UUID is in this pull request: https://github.com/karelzak/util-linux/pull/438 I hope it is correct now. UDF image with "ÿ" is added to tests. -- Pali Rohár pali.rohar@xxxxxxxxx
Attachment:
signature.asc
Description: This is a digitally signed message part.