On Fri, May 12, 2017 at 04:38:59PM +0200, Pali Rohár wrote: > Hi! > > Since beginning libblkid's udf code handles 16bit OSTA compressed > unicode as UTF-16BE and 8bit OSTA compressed unicode as UTF-8. > > In UDF 2.01 specification is written: > ==== > For a CompressionID of 8 or 16, the value of the CompressionID shall > specify the number of BitsPerCharacter for the d-characters defined in > the CharacterBitStream field. Each sequence of CompressionID bits in the > CharacterBitStream field shall represent an OSTA Compressed Unicode d- > character. The bits of the character being encoded shall be added to the > CharacterBitStream from most- to least-significant-bit. The bits shall > be added to the CharacterBitStream starting from the most significant > bit of the current byte being encoded into. The value of the OSTA > Compressed Unicode d-character interpreted as a Uint16 defines the value > of the corresponding d-character in the Unicode 2.0 standard. > ==== > > So it means that 8bit OSTA compressed unicode buffer contains sequence > of Unicode codepoints, one per 8 bits. What effectively means > equivalence with Latin1 (ISO-8859-1) encoding. > > And 16bit OSTA compressed unicode means sequence of Unicode codepoints, > one per 16 bits in big endian. What is probably only UCS-2 and not full > UTF-16. > > So problem is with 8bit OSTA compressed unicode if contains bytes which > are not UTF-8 invariants (ASCII). As those those are decoded differently > with Latin1 and UTF-8. > > Which means libblkid udf implementation of reading Unicode strings is > wrong and affects all read operations (Label, UUID, ...). > > To verify this problem I prepared small udf image (attached) which has > logical volume identifier (known as label): 0x08 0xC3 0xBF 0x00 ... 0x03 > > According to spec it should be decoded as string "ÿ" > (LATIN CAPITAL LETTER A WITH TILDE, INVERTED QUESTION MARK). > > But blkid show me "ÿ" (LATIN SMALL LETTER Y WITH DIAERESIS). > > I checked grub2 and Windows implementations and they show "ÿ". > > So... what to do with blkid implementation? Fixing it would mean to > break all existing labels and uuids on Linux. Not fixing it would mean > to have different labels across different systems which implements it > properly. The issue has never been reported, so I guess the number of the affected LABELs is pretty small :-) >From my point of view it would be better to follow the standard, fix the issue and be compatible with the another utils and systems. It would be nice to fix it now for v2.30 where we already have changes in udf/iso stuff. Please, send the patch :-) Karel -- Karel Zak <kzak@xxxxxxxxxx> http://karelzak.blogspot.com -- To unsubscribe from this list: send the line "unsubscribe util-linux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html