Re: [PATCH] hfsplus: fixes worst-case unicode to char conversion of file names

Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> · Sat, 5 Apr 2014 23:09:56 +0100 (BST)

Hi Anton,

------------------------------
On Sat, Apr 5, 2014 9:37 PM BST Anton Altaparmakov wrote:

>Hi Hin-Tak,
>
>On 4 Apr 2014, at 23:11, Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> wrote:
>> On Fri, Apr 4, 2014 10:24 PM BST Anton Altaparmakov wrote:
>> 
>> On 4 Apr 2014, at 20:46, Hin-Tak Leung <hintak.leung@xxxxxxxxx> wrote:
>>> From: Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx>
>>> 
>>> The HFS Plus Volume Format specification (TN1150) states that
>>> file names are stored internally as a maximum of 255 unicode
>>> characters, as defined by The Unicode Standard, Version 2.0
>>> [Unicode, Inc. ISBN 0-201-48345-9]. File names are converted by
>>> the NLS system on Linux before presented to the user.
>>> 
>>> Though it is rare, the worst-case is 255 CJK characters converting
>>> to UTF-8 with 1 unicode character to 3 bytes. Surrogate pairs are
>>> no worse. The receiver buffer needs to be 255 x 3 bytes,
>>> not 255 bytes as the code has always been.
>> 
>> You are correct that that buffer is too small.  However:
>> 
>> 1) The correct size for the buffer is NLS_MAX_CHARSET_SIZE * HFSPLUS_MAX_STRLEN + 1 and not using a magic constant "3" (which is actually not big enough in case the string is storing UTF-16 rather than UCS-2 Unicode which I have observed happen on NTFS written to by asian versions of Windows but I see no reason why it could not happen on OS X, too, especially on a HFS+ volume that has been written to by a Windows HFS+ driver - even if native OS X driver would not normally do it - I have not looked at it I admit).  That reliable source of information Wikipedia suggests Mac OS X also uses UTF-16 as of OS X 10.3 at least in userspace so chances are it either also uses it in the kernel or if not yet it might well do in future:
>> 
>>     http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
>> 
>> 2) You are now allocating a huge buffer on the stack.  This is not a good thing to do in the kernel (think 4k stack kernel config - that single variable is consuming about a quarter of available stack).  You need to allocate the buffer dynamically.  As you only need to do the allocation on entry to hfsplus_readdir() and deallocate it on exit it is not a problem as it could be if you had to allocate/free for every filename.
>> 
>> 
>> Hi Anton,
>> 
>> Thanks for the comments.
>
>You are welcome.
>
>> NLS_MAX_CHARSET_SIZE is 6 include/linux/nls.h but I think it is too generous in this case. It is correct that a unicode character needs at worst 6 bytes to code, but those in the upper range of that when encoded in UTF-16 would require a surrogate pair - i.e. it goes from *two* UTF-16 units to 6 bytes. So that's still x3, not x6. Also Unicode 2.0 covers only the first supplementary plane, and only requires up to 4 bytes. So that's what my "Surrogate pairs are no worse." part of the message was about. Please correct me if this reasoning is wrong.
>
>Yes, I think as things stand you are correct.  However, using NLS_MAX_CHARSET_SIZE has the advantage of being future proof - any changes to character handling in the kernel and/or to Unicode standards will automatically be fixed in HFS+ when using this constant.  As this is a transient buffer only allocated for the duration of the system call it really does not matter whether it is 766 or 1532 bytes.  In either case kmalloc() will allocate it from the respective slab (assuming slab allocator in use) and as they are both smaller than the smallest PAGE_SIZE on any architecture there is no increase in memory pressure or anything else by allocating the bigger buffer so whilst you are technically correct if I were writing the code I would definitely use NLS_MAX_CHARSET_SIZE.
>
>If you decide to definitely use "3" I suggest you at least give make a #define with some sensible name of your choice and also add a comment to say that that may need increasing in future if NLS handling in the kernel and/or Unicode standard changes...
>

Thanks for the comments. If I knew of the constant, I'd have used it :-). Actually I think the comments around NLS_MAX_CHARSET_SIZE is strictly speaking, wrong/misleading. HFS+ internally is definitely UTF-16 BE, so the question is just what is the worst of UTF16-BE to any encoding. The worst case of 6 bytes with UTF-8 is only achievable through surrogate pairs (i.e. two int16 units), but I think UTF-8 may not be the worst. GB18030 covers the whole of unicode, but at different code points, which means it may be possible that some code points in the basic plane of unicode maps to a higher plane in GB18030. If that's the case, 4 bytes is needed, rather than 3.

I think I'll use the constant, but put a comment in that it is somewhat wasteful as it is almost certain that at most half or 2/3 of it is needed. I'll
also switch to dynamic allocation also, and since this is simple enough, will possibly just add the related change in extended attribute which   
Vyacheslav points out. And will prepare v2 of a patch.

Hin-Tak
P.S. until a few months ago, I had an @*.cam e-mail address also, though not on hermes.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html