Hi Hin-Tak, On 4 Apr 2014, at 23:11, Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Apr 4, 2014 10:24 PM BST Anton Altaparmakov wrote: >> >> On 4 Apr 2014, at 20:46, Hin-Tak Leung <hintak.leung@xxxxxxxxx> wrote: >>> From: Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> >>> >>> The HFS Plus Volume Format specification (TN1150) states that >>> file names are stored internally as a maximum of 255 unicode >>> characters, as defined by The Unicode Standard, Version 2.0 >>> [Unicode, Inc. ISBN 0-201-48345-9]. File names are converted by >>> the NLS system on Linux before presented to the user. >>> >>> Though it is rare, the worst-case is 255 CJK characters converting >>> to UTF-8 with 1 unicode character to 3 bytes. Surrogate pairs are >>> no worse. The receiver buffer needs to be 255 x 3 bytes, >>> not 255 bytes as the code has always been. >> >> You are correct that that buffer is too small. However: >> >> 1) The correct size for the buffer is NLS_MAX_CHARSET_SIZE * HFSPLUS_MAX_STRLEN + 1 and not using a magic constant "3" (which is actually not big enough in case the string is storing UTF-16 rather than UCS-2 Unicode which I have observed happen on NTFS written to by asian versions of Windows but I see no reason why it could not happen on OS X, too, especially on a HFS+ volume that has been written to by a Windows HFS+ driver - even if native OS X driver would not normally do it - I have not looked at it I admit). That reliable source of information Wikipedia suggests Mac OS X also uses UTF-16 as of OS X 10.3 at least in userspace so chances are it either also uses it in the kernel or if not yet it might well do in future: >> >> http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings >> >> 2) You are now allocating a huge buffer on the stack. This is not a good thing to do in the kernel (think 4k stack kernel config - that single variable is consuming about a quarter of available stack). You need to allocate the buffer dynamically. As you only need to do the allocation on entry to hfsplus_readdir() and deallocate it on exit it is not a problem as it could be if you had to allocate/free for every filename. >> > > Hi Anton, > > Thanks for the comments. You are welcome. > NLS_MAX_CHARSET_SIZE is 6 include/linux/nls.h but I think it is too generous in this case. It is correct that a unicode character needs at worst 6 bytes to code, but those in the upper range of that when encoded in UTF-16 would require a surrogate pair - i.e. it goes from *two* UTF-16 units to 6 bytes. So that's still x3, not x6. Also Unicode 2.0 covers only the first supplementary plane, and only requires up to 4 bytes. So that's what my "Surrogate pairs are no worse." part of the message was about. Please correct me if this reasoning is wrong. Yes, I think as things stand you are correct. However, using NLS_MAX_CHARSET_SIZE has the advantage of being future proof - any changes to character handling in the kernel and/or to Unicode standards will automatically be fixed in HFS+ when using this constant. As this is a transient buffer only allocated for the duration of the system call it really does not matter whether it is 766 or 1532 bytes. In either case kmalloc() will allocate it from the respective slab (assuming slab allocator in use) and as they are both smaller than the smallest PAGE_SIZE on any architecture there is no increase in memory pressure or anything else by allocating the bigger buffer so whilst you are technically correct if I were writing the code I would definitely use NLS_MAX_CHARSET_SIZE. If you decide to definitely use "3" I suggest you at least give make a #define with some sensible name of your choice and also add a comment to say that that may need increasing in future if NLS handling in the kernel and/or Unicode standard changes... Best regards, Anton > I'll switch to dynamic allocation and prepare a revised patch, after further discussion on the x3 vs x6 issue. > > Hin-Tak -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge J.J. Thomson Avenue, Cambridge, CB3 0RB, UK -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html