Re: [PATCH] hfsplus: fix decomposition of Hangul characters

Viacheslav Dubeyko <slava@xxxxxxxxxxx> · Tue, 28 Nov 2017 08:30:54 -0800

On Tue, 2017-11-28 at 12:02 -0300, Ernesto A. Fernández wrote:
> On Mon, Nov 27, 2017 at 02:40:53PM -0800, Viacheslav Dubeyko wrote:
> > 
> > On Mon, 2017-11-27 at 16:36 -0300, Ernesto A. Fernández wrote:
> > > 
> > > It would be good to have some feedback on the use of code under
> > > the
> > > Unicode License, I don't know if that can be a problem.
> > > 
> > > To Ting-Chang Hou: This patch is slightly different from the one
> > > you
> > > tested with your mac. You are still tagged as Tested-by, but you
> > > may
> > > want to check again that I have not broken anything.
> > > 
> > > Thanks,
> > > Ernest
> > > 
> > > -- >8 --
> > > Subject: [PATCH] hfsplus: fix decomposition of Hangul characters
> > > 
> > > Files created under macOS become essentially unusable under linux
> > > if
> > > their names contain any Hangul, and vice versa.
> > > 
> > > This happens because the normalization of Hangul characters is a
> > > special
> > > case: it can be done algorithmically, without need of a table.
> > > The
> > > hfsplus module deals with Hangul correctly when composing, but
> > > completely forgets about it when performing a decomposition.
> > > 
> > > Solve this by using the Hangul decomposition function provided in
> > > the
> > > Unicode Standard. It's under the Unicode License, compatible with
> > > the
> > > GPL.
> > > 
> > I believe it makes sense to share more details about algorithm
> > details.
> > I think that the Unicode standard contains some explanation how to
> > decompose correctly. It will be great to have some brief citation
> > in the
> > patch description.
> The algorithm is very simple, the best way to understand it is just
> looking at the code. I don't know the first thing about Korean
> writing, so
> I don't think I should attempt to explain why the decomposition is
> done
> this way. If somebody else is interested in the details, they can
> follow
> the citation in the header comment of the decompose_hangul function.
> 

You should share the link where it can be found. Finally, the comment
needs in more detailed explanation.

> > 
> > > 
> > > This patch will cause trouble for Hangul filenames already
> > > created by
> > > the module in the past. This shouldn't be a special concern
> > > because the
> > > main purpose of the module was always sharing with macOS. If a
> > > user
> > > actually needs to access such a file the nodecompose mount option
> > > should
> > > be enough.
> > > 
> > > Reported-by: Ting-Chang Hou <tchou@xxxxxxxxxxxx>
> > > Tested-by: Ting-Chang Hou <tchou@xxxxxxxxxxxx>
> > > Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.
> > > com>
> > > ---
> > >  fs/hfsplus/unicode.c | 62
> > > +++++++++++++++++++++++++++++++++++++++++++++++-----
> > >  1 file changed, 56 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
> > > index dfa90c2..7ceae91 100644
> > > --- a/fs/hfsplus/unicode.c
> > > +++ b/fs/hfsplus/unicode.c
> > > @@ -272,8 +272,8 @@ static inline int asc2unichar(struct
> > > super_block *sb, const char *astr, int len,
> > >  	return size;
> > >  }
> > >  
> > > -/* Decomposes a single unicode character. */
> > > -static inline u16 *decompose_unichar(wchar_t uc, int *size)
> > > +/* Decomposes a non-Hangul unicode character. */
> > > +static u16 *decompose_nonhangul(wchar_t uc, int *size)
> > >  {
> > >  	int off;
> > >  
> > > @@ -296,6 +296,51 @@ static inline u16 *decompose_unichar(wchar_t
> > > uc, int *size)
> > >  	return hfsplus_decompose_table + (off / 4);
> > >  }
> > >  
> > > +/*
> > > + * Decomposes a Hangul unicode character.
> > > + *
> > > + * This function was adapted from sample code in section 3.12 of
> > > the
> > > + * Unicode Standard, version 10.0.
> > > + *
> > > + * Copyright (C) 1991-2017 Unicode, Inc.  All rights
> > > reserved.  Distributed
> > > + * under the Terms of Use in http://www.unicode.org/copyright.ht
> > > ml.
> > > + */
> > > +static int decompose_hangul(wchar_t uc, u16 *result)
> > > +{
> > > +	int index;
> > > +	int l, v, t;
> > > +
> > > +	index = uc - Hangul_SBase;
> > > +	if (index < 0 || index >= Hangul_SCount)
> > > +		return 0;
> > > +
> > > +	l = Hangul_LBase + index / Hangul_NCount;
> > > +	v = Hangul_VBase + (index % Hangul_NCount) /
> > > Hangul_TCount;
> > > +	t = Hangul_TBase + index % Hangul_TCount;
> > > +
> > > +	result[0] = l;
> > > +	result[1] = v;
> > > +	if (t != Hangul_TBase) {
> > > +		result[2] = t;
> > > +		return 3;
> > > +	}
> > > +	return 2;
> > > +}
> > > +
> > > +/* Decomposes a single unicode character. */
> > > +static u16 *decompose_unichar(wchar_t uc, int *size, u16
> > > *hangul_buffer)
> > > +{
> > > +	u16 *result;
> > > +
> > > +	/* Hangul is handled separately */
> > > +	result = hangul_buffer;
> > > +	*size = decompose_hangul(uc, result);
> > Is it possible to distinguish the different cases beforehand and to
> > select one proper function for execution without the execution both
> > ones
> > in the worst case?
> The decompose_hangul and decompose_nonhangul functions are called
> only once,
> so I would expect the compiler to inline them both. The result should
> be
> similar to your suggestion.
> 

It will be much better to have:

If (hangful) {
    <decompose_hangul>
} else {
    <decompose_nonhangul>
}

Currently, the code looks weird. Especially, decompose_hangul() is able
to distinguish two different cases. So, it's easy to extract the check.

> > 
> > > 
> > > +	if (*size == 0)
> > > +		/* not Hangul */
> > > +		result = decompose_nonhangul(uc, size);
> > I think it makes sense to use the brackets here. If I remember
> > correctly, this is the code style's requirement.
> I couldn't find such a requirement, but I did find some other places
> within
> the hfsplus code where the comments are done my way. For example, in
> the
> hfsplus_cat_write_inode() function of the inode.c file.
> 

The hfsplus code contains many awful places. It doesn't mean that it
needs to follow the bad style examples. If you have only one string:

if (condition)
    <do_something>

then it's OK not to use the brackets. Otherwise, you should include the
branch into the brackets.

Thanks,
Vyacheslav Dubeyko.