Re: [PATCH] hfsplus: fix decomposition of Hangul characters

Ernesto A. Fernández <ernesto.mnd.fernandez@xxxxxxxxx> · Tue, 28 Nov 2017 12:02:35 -0300

On Mon, Nov 27, 2017 at 02:40:53PM -0800, Viacheslav Dubeyko wrote:
> On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> > It would be good to have some feedback on the use of code under the
> > Unicode License, I don't know if that can be a problem.
> > 
> > To Ting-Chang Hou: This patch is slightly different from the one you
> > tested with your mac. You are still tagged as Tested-by, but you may
> > want to check again that I have not broken anything.
> > 
> > Thanks,
> > Ernest
> > 
> > -- >8 --
> > Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> > 
> > Files created under macOS become essentially unusable under linux if
> > their names contain any Hangul, and vice versa.
> > 
> > This happens because the normalization of Hangul characters is a special
> > case: it can be done algorithmically, without need of a table. The
> > hfsplus module deals with Hangul correctly when composing, but
> > completely forgets about it when performing a decomposition.
> > 
> > Solve this by using the Hangul decomposition function provided in the
> > Unicode Standard. It's under the Unicode License, compatible with the
> > GPL.
> > 
> 
> I believe it makes sense to share more details about algorithm details.
> I think that the Unicode standard contains some explanation how to
> decompose correctly. It will be great to have some brief citation in the
> patch description.

The algorithm is very simple, the best way to understand it is just
looking at the code. I don't know the first thing about Korean writing, so
I don't think I should attempt to explain why the decomposition is done
this way. If somebody else is interested in the details, they can follow
the citation in the header comment of the decompose_hangul function.

> > This patch will cause trouble for Hangul filenames already created by
> > the module in the past. This shouldn't be a special concern because the
> > main purpose of the module was always sharing with macOS. If a user
> > actually needs to access such a file the nodecompose mount option should
> > be enough.
> > 
> > Reported-by: Ting-Chang Hou <tchou@xxxxxxxxxxxx>
> > Tested-by: Ting-Chang Hou <tchou@xxxxxxxxxxxx>
> > Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@xxxxxxxxx>
> > ---
> >  fs/hfsplus/unicode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 56 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> > index dfa90c2..7ceae91 100644
> > --- a/fs/hfsplus/unicode.c
> > +++ b/fs/hfsplus/unicode.c
> > @@ -272,8 +272,8 @@ static inline int asc2unichar(struct super_block *sb, const char *astr, int len,
> >  	return size;
> >  }
> >  
> > -/* Decomposes a single unicode character. */
> > -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> > +/* Decomposes a non-Hangul unicode character. */
> > +static u16 *decompose_nonhangul(wchar_t uc, int *size)
> >  {
> >  	int off;
> >  
> > @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t uc, int *size)
> >  	return hfsplus_decompose_table + (off / 4);
> >  }
> >  
> > +/*
> > + * Decomposes a Hangul unicode character.
> > + *
> > + * This function was adapted from sample code in section 3.12 of the
> > + * Unicode Standard, version 10.0.
> > + *
> > + * Copyright (C) 1991-2017 Unicode, Inc.  All rights reserved.  Distributed
> > + * under the Terms of Use in http://www.unicode.org/copyright.html.
> > + */
> > +static int decompose_hangul(wchar_t uc, u16 *result)
> > +{
> > +	int index;
> > +	int l, v, t;
> > +
> > +	index = uc - Hangul_SBase;
> > +	if (index < 0 || index >= Hangul_SCount)
> > +		return 0;
> > +
> > +	l = Hangul_LBase + index / Hangul_NCount;
> > +	v = Hangul_VBase + (index % Hangul_NCount) / Hangul_TCount;
> > +	t = Hangul_TBase + index % Hangul_TCount;
> > +
> > +	result[0] = l;
> > +	result[1] = v;
> > +	if (t != Hangul_TBase) {
> > +		result[2] = t;
> > +		return 3;
> > +	}
> > +	return 2;
> > +}
> > +
> > +/* Decomposes a single unicode character. */
> > +static u16 *decompose_unichar(wchar_t uc, int *size, u16 *hangul_buffer)
> > +{
> > +	u16 *result;
> > +
> > +	/* Hangul is handled separately */
> > +	result = hangul_buffer;
> > +	*size = decompose_hangul(uc, result);
> 
> Is it possible to distinguish the different cases beforehand and to
> select one proper function for execution without the execution both ones
> in the worst case?

The decompose_hangul and decompose_nonhangul functions are called only once,
so I would expect the compiler to inline them both. The result should be
similar to your suggestion.

> > +	if (*size == 0)
> > +		/* not Hangul */
> > +		result = decompose_nonhangul(uc, size);
> 
> I think it makes sense to use the brackets here. If I remember
> correctly, this is the code style's requirement.

I couldn't find such a requirement, but I did find some other places within
the hfsplus code where the comments are done my way. For example, in the
hfsplus_cat_write_inode() function of the inode.c file.

> Thanks,
> Vyacheslav Dubeyko.

Thanks,
Ernest

> > +	return result;
> > +}
> > +
> >  int hfsplus_asc2uni(struct super_block *sb,
> >  		    struct hfsplus_unistr *ustr, int max_unistr_len,
> >  		    const char *astr, int len)
> > @@ -303,13 +348,14 @@ int hfsplus_asc2uni(struct super_block *sb,
> >  	int size, dsize, decompose;
> >  	u16 *dstr, outlen = 0;
> >  	wchar_t c;
> > +	u16 dhangul[3];
> >  
> >  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> >  	while (outlen < max_unistr_len && len > 0) {
> >  		size = asc2unichar(sb, astr, len, &c);
> >  
> >  		if (decompose)
> > -			dstr = decompose_unichar(c, &dsize);
> > +			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
> >  		else
> >  			dstr = NULL;
> >  		if (dstr) {
> > @@ -344,6 +390,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
> >  	unsigned long hash;
> >  	wchar_t c;
> >  	u16 c2;
> > +	u16 dhangul[3];
> >  
> >  	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
> >  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> > @@ -357,7 +404,7 @@ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str)
> >  		len -= size;
> >  
> >  		if (decompose)
> > -			dstr = decompose_unichar(c, &dsize);
> > +			dstr = decompose_unichar(c, &dsize, &dhangul[0]);
> >  		else
> >  			dstr = NULL;
> >  		if (dstr) {
> > @@ -396,6 +443,7 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
> >  	const char *astr1, *astr2;
> >  	u16 c1, c2;
> >  	wchar_t c;
> > +	u16 dhangul_1[3], dhangul_2[3];
> >  
> >  	casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags);
> >  	decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags);
> > @@ -413,7 +461,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
> >  			len1 -= size;
> >  
> >  			if (decompose)
> > -				dstr1 = decompose_unichar(c, &dsize1);
> > +				dstr1 = decompose_unichar(c, &dsize1,
> > +							  &dhangul_1[0]);
> >  			if (!decompose || !dstr1) {
> >  				c1 = c;
> >  				dstr1 = &c1;
> > @@ -427,7 +476,8 @@ int hfsplus_compare_dentry(const struct dentry *dentry,
> >  			len2 -= size;
> >  
> >  			if (decompose)
> > -				dstr2 = decompose_unichar(c, &dsize2);
> > +				dstr2 = decompose_unichar(c, &dsize2,
> > +							  &dhangul_2[0]);
> >  			if (!decompose || !dstr2) {
> >  				c2 = c;
> >  				dstr2 = &c2;
> 
>