------------------------------ On Fri, Apr 4, 2014 10:24 PM BST Anton Altaparmakov wrote: >Hi, > >On 4 Apr 2014, at 20:46, Hin-Tak Leung <hintak.leung@xxxxxxxxx> wrote: >> From: Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> >> >> The HFS Plus Volume Format specification (TN1150) states that >> file names are stored internally as a maximum of 255 unicode >> characters, as defined by The Unicode Standard, Version 2.0 >> [Unicode, Inc. ISBN 0-201-48345-9]. File names are converted by >> the NLS system on Linux before presented to the user. >> >> Though it is rare, the worst-case is 255 CJK characters converting >> to UTF-8 with 1 unicode character to 3 bytes. Surrogate pairs are >> no worse. The receiver buffer needs to be 255 x 3 bytes, >> not 255 bytes as the code has always been. > >You are correct that that buffer is too small. However: > >1) The correct size for the buffer is NLS_MAX_CHARSET_SIZE * HFSPLUS_MAX_STRLEN + 1 and not using a magic constant "3" (which is actually not big enough in case the string is storing UTF-16 rather than UCS-2 Unicode which I have observed happen on NTFS written to by asian versions of Windows but I see no reason why it could not happen on OS X, too, especially on a HFS+ volume that has been written to by a Windows HFS+ driver - even if native OS X driver would not normally do it - I have not looked at it I admit). That reliable source of information Wikipedia suggests Mac OS X also uses UTF-16 as of OS X 10.3 at least in userspace so chances are it either also uses it in the kernel or if not yet it might well do in future: > > http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings > >2) You are now allocating a huge buffer on the stack. This is not a good thing to do in the kernel (think 4k stack kernel config - that single variable is consuming about a quarter of available stack). You need to allocate the buffer dynamically. As you only need to do the allocation on entry to hfsplus_readdir() and deallocate it on exit it is not a problem as it could be if you had to allocate/free for every filename. > Hi Anton, Thanks for the comments. NLS_MAX_CHARSET_SIZE is 6 include/linux/nls.h but I think it is too generous in this case. It is correct that a unicode character needs at worst 6 bytes to code, but those in the upper range of that when encoded in UTF-16 would require a surrogate pair - i.e. it goes from *two* UTF-16 units to 6 bytes. So that's still x3, not x6. Also Unicode 2.0 covers only the first supplementary plane, and only requires up to 4 bytes. So that's what my "Surrogate pairs are no worse." part of the message was about. Please correct me if this reasoning is wrong. I'll switch to dynamic allocation and prepare a revised patch, after further discussion on the x3 vs x6 issue. Hin-Tak -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html