Re: Eliminating UDF iocharset!=utf8 code (Re: [PATCH 6/8] Support non-BMP characters in UDF)

Vladimir 'φ-coder/phcoder' Serbinenko <phcoder@xxxxxxxxx> · Thu, 17 May 2012 17:30:32 +0200

On 17.05.2012 16:40, Jan Kara wrote:

> On Thu 17-05-12 02:48:49, Vladimir 'φ-coder/phcoder' Serbinenko wrote:
>>
>>> I've noticed another duplication in the UDF code: there
>>> is NLS support and separate UTF-8 support. UTF-8 is support by 2 ways
>>> actually: with -o utf8 and -o iocharset=utf8 which imply different
>>> codepaths. Specific UTF-8 support is probably slightly faster by
>>> avoiding calls and basically doing everything with shifts (or can be
>>> made so with a small patch). Should I perhaps kill one of them? Is
>>> iocharset!=utf8 still of any importance? I haven't seen it in ages.
>>> Perhaps we could keep just the performant UTF-8 support and map
>>> iocharset=utf8 to it and drop iocharset!=utf8? iocharset!=utf8 probably
>>> has no users anyway so keeping it we're likely to keep bugs and code
>>> duplication with no benefit.
>>>
>>
>> Linux seems to support UTF-8-only pretty strongly: http://yarchive.net/comp/linux/utf8.html
>> (message from Sun, 15 Feb 2004 02:42:45 GMT).
>> And I completely agree.
>> If it's ok to kill iocharset!=utf8 I'll propose a series of 3 patches (killing iocharset!=utf8,
>> extending utf16toutf8/utf8toutf16 for unaligned input, changing UDF code to use common functions)
>   Well, yes, utf8 is currently the only sane setting but that doesn't mean
> someone isn't using (e.g. iso8859-2) for strange reasons...

What would be the correct behaviour if we encounter the characters which
can't be represented in the given charset? Currently the code replaces
them with question marks but since this doesn't complete round trip
successfully someone attempting to open or stat the file by name won't
be able to. So these files become pretty much "ghosts" that you see but
can't do anything with them. Hiding them altogether would lead to
situations when the disk appears empty but df shows that it's 100% full.
While encodings like iso-8859-1 are relatively straightforward, some
other (East Asian) encodings may produce '/' as part of another
character and so confuse the kernel. Such encodings are also stateful
and I'm pretty sure that current code bugs on them.
I don't know if these quirks can be used to make a program load a file
it wasn't intended to and whether it's of any security concern.
I'm aware of bash security problems with such characters when part of
Chinese character is interpreted as backtick.
I don't think that these problems can create a security hole on kernel
side, they can be used to confuse userspace but I doubt it's anything
exploitable but it's something I'd be doubtful about.

> We should
> regress in user visible functionality only for really good reasons and here
> I don't see a strong reason. So I'd like to keep current iocharset mount
> option and make utf8 option equivalent to iocharset=utf8. Since I don't
> think the speed benefit of dedicated CS0<->UTF8 functions is really that
> big and UDF isn't exactly a filesystem where it would matter anyway, I'd
> just remove those dedicated functions and use the generic ones instead.

Ok, I'll prepare a patch.
-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko

Attachment:
signature.asc

Description: OpenPGP digital signature