Re: libblkid: udf: Incorrect implementation of Unicode strings

Karel Zak <kzak@xxxxxxxxxx> · Wed, 17 May 2017 09:13:33 +0200

On Tue, May 16, 2017 at 04:02:45PM +0200, Pali Rohár wrote:
> > > > If yes... then we can keep it unchanged, generate UUDI in the same way
> > > > as now (hexadecimal digits). The "OSTA Unicode fix" maybe be used for
> > > > LABEL= (etc) only. I guess nothing forces use to generate UUIDs from 
> > > > decoded VolSetId.
> > > > 
> > > > Anyway, UUID has to be printable.
> > > 
> > > Lets first define allowed characters in UUID and then what we do with
> > > UDF's UUID.
> > > 
> > > Printable means only printable ASCII? Or also printable from Unicode? Or
> > > only alphanumeric?
> > 
> > I'd like to be very conservative and avoid anything else than ASCII.
> > It's identifier that should be usable everywhere.
> > 
> > udev uses the UUID for paths and symlinks, "bad chars" are escaped and
> > it's very user unfriendly. We should be also user friendly to non-UTF
> > users, terminals, etc.
> > 
> > IMHO the best solution would be to use lowercase hex-digits like for
> > another filesystems (and super ideal would be follow UUID notation for
> > formatting (e.g. "c5490147-2a6c-4c8a-aa1b-33492034f927") ;-).
> 
> We have only 16 Unicode characters (and first 8 are hexdigits), so
> above format for 128bit UUID notation is not possible.
> 
> Currently VolSetID is parsed as bytes instead of (Unicode) characters.
> We can correctly parse it, read first 16 chars, convert then UTF-8 and
> then use those UTF-8 bytes as input for generating UUID. This step has
> advantage that deals with Unicode (and does not matter on internal
> representation of VolSetID string stored in UDF) and also that produce
> normalized bytes which can be later used...
> 
> You want to have only lowercase hexdigits in UUID. I understand this
> reason, it makes sense. But how to generate UUID from (potentially
> arbitrary) UTF-8 sequence of 16 Unicode characters? Because UTF-8 is
> variable length encoding.
> 
> Currently UUID generator split those 16 chars/bytes into first and
> second half because according to UDF standard that first half should
> contain only hexdigits (and in most cases they really are!). Half which
> is not alphanumeric is encoded via %02x per byte. And final string
> truncated to 16 bytes (to have fixed length).
> 
> What we can do is to take UTF-8 sequence (instead raw UDF bytes) and
> encode non-hexdigits bytes (instead non-alnum) via %02x. And truncate
> again to 16 hexdigits.

This is what I expected... don't think about it as about characters,
but as about random bytes that we print as %02x. The result will be 
always the same for the same UDF header, right?

The another option would be use some hash sum to standardize arbitrary
number of bytes (for example we use MD5 to generate UUID for
libblkid/src/superblocks/hfs.c). In this case we can use also some
another bytes from the header, for example volume_descriptor.tag. The
disadvantage is dependence on checksum code, so bad portability to
another projects (grub, etc.).

    Karel

-- 
 Karel Zak  <kzak@xxxxxxxxxx>
 http://karelzak.blogspot.com
--
To unsubscribe from this list: send the line "unsubscribe util-linux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html