Re: [PATCH] hfsplus: fixes worst-case unicode to char conversion of file names

Vyacheslav Dubeyko <slava@xxxxxxxxxxx> · Sat, 5 Apr 2014 13:07:24 +0400

Hi Hin-Tak,

On Apr 5, 2014, at 2:11 AM, Hin-Tak Leung wrote:

> ------------------------------
> On Fri, Apr 4, 2014 10:24 PM BST Anton Altaparmakov wrote:
> 
>> Hi,
>> 
>> On 4 Apr 2014, at 20:46, Hin-Tak Leung <hintak.leung@xxxxxxxxx> wrote:
>>> From: Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx>
>>> 
>>> The HFS Plus Volume Format specification (TN1150) states that
>>> file names are stored internally as a maximum of 255 unicode
>>> characters, as defined by The Unicode Standard, Version 2.0
>>> [Unicode, Inc. ISBN 0-201-48345-9]. File names are converted by
>>> the NLS system on Linux before presented to the user.
>>> 
>>> Though it is rare, the worst-case is 255 CJK characters converting
>>> to UTF-8 with 1 unicode character to 3 bytes. Surrogate pairs are
>>> no worse. The receiver buffer needs to be 255 x 3 bytes,
>>> not 255 bytes as the code has always been.
>> 
>> You are correct that that buffer is too small.  However:
>> 
>> 1) The correct size for the buffer is NLS_MAX_CHARSET_SIZE * HFSPLUS_MAX_STRLEN + 1 and not using a magic constant "3" (which is actually not big enough in case the string is storing UTF-16 rather than UCS-2 Unicode which I have observed happen on NTFS written to by asian versions of Windows but I see no reason why it could not happen on OS X, too, especially on a HFS+ volume that has been written to by a Windows HFS+ driver - even if native OS X driver would not normally do it - I have not looked at it I admit).  That reliable source of information Wikipedia suggests Mac OS X also uses UTF-16 as of OS X 10.3 at least in userspace so chances are it either also uses it in the kernel or if not yet it might well do in future:
>> 
>>     http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
>> 
>> 2) You are now allocating a huge buffer on the stack.  This is not a good thing to do in the kernel (think 4k stack kernel config - that single variable is consuming about a quarter of available stack).  You need to allocate the buffer dynamically.  As you only need to do the allocation on entry to hfsplus_readdir() and deallocate it on exit it is not a problem as it could be if you had to allocate/free for every filename.
>> 
> 
> Hi Anton,
> 
> Thanks for the comments. NLS_MAX_CHARSET_SIZE is 6 include/linux/nls.h but I think it is too generous in this case. It is correct that a unicode character needs at worst 6 bytes to code, but those in the upper range of that when encoded in UTF-16 would require a surrogate pair - i.e. it goes from *two* UTF-16 units to 6 bytes. So that's still x3, not x6. Also Unicode 2.0 covers only the first supplementary plane, and only requires up to 4 bytes. So that's what my "Surrogate pairs are no worse." part of the message was about. Please correct me if this reasoning is wrong.
> 
> I'll switch to dynamic allocation and prepare a revised patch, after further discussion on the x3 vs x6 issue.
> 

Thank you for the patch. I have some additional remarks.

(1) At first I think that it makes sense to describe the reproducing path and resulting crash
(if it is possible, of course). Such description can be useful for distinguishing user's issues.

(2) I think that magic "3" is not good way too. :) Named constant is much better.

(3) Now buffer will be big in size. So, to allocate it on the stack is dangerous way. I agree
with it. It needs to use dynamic allocation (kmalloc/kfree) or look-aside cache
(kmem_cache_alloc/kmem_cache_free).

(4) Unfortunately, we have the same issue and for extended attributes case
(HFSPLUS_ATTR_MAX_STRLEN). It needs to fix it too. Could you fix it?

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html