Re: [PATCH 4/5] xdiff: introduce XDF_IGNORE_CASE

Jakub Narebski <jnareb@xxxxxxxxx> · Wed, 22 Feb 2012 10:07:56 -0800 (PST)

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Teach the hash function and per-line comparison logic to compare lines
> while ignoring the differences in case.  It is not an ignore-whitespace
> option but still needs to trigger the inexact match logic, and that is
> why the previous step introduced XDF_INEXACT_MATCH mask.

Nb. how it compares with ignore case in filesystem paths?

> Assign the 7th bit for this option, and move the bits to select diff
> algorithms out of the way in order to leave room for a few bits to add
> more variants of ignore-whitespace, such as --ignore-tab-expansion, if
> somebody else is inclined to do so later.

Or do a proper Unicode sorting / collation algorithm, with different
levels 

  (4.3 Form a sort key for each string, UTS #10.):

     Level 1: alphabetic ordering
     Level 2: diacritic ordering
     Level 3: case ordering
     Level 4: tie-breaking (e.g. in the case when variable is 'shifted')

> We would still need to teach the front-end to flip this bit, for this
> change to be any useful.
> 
> Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
> ---

> +static inline int match_a_byte(char ch1, char ch2, long flags)
> +{
> +	if (ch1 == ch2)
> +		return 1;
> +	if (!(flags & XDF_IGNORE_CASE) || ((ch1 | ch2) & 0x80))
> +		return 0;
> +	if (isupper(ch1))
> +		ch1 = tolower(ch1);
> +	if (isupper(ch2))
> +		ch2 = tolower(ch2);
> +	return (ch1 == ch2);
> +}

<del>
Wouldn't a better solution be a collate algorithm rather than changing
a sorting function?  Or is it a performance hack on typical body of
text under version control (mainly lowercase)?
</del>

"(libc.info)Collation Fuctions" says:

     The functions `strcoll' and `wcscoll' perform this translation
  implicitly, in order to do one comparison.  By contrast, `strxfrm' and
  `wcsxfrm' perform the mapping explicitly.  If you are making multiple
  comparisons using the same string or set of strings, it is likely to be
  more efficient to use `strxfrm' or `wcsxfrm' to transform all the
  strings just once, and subsequently compare the transformed strings
  with `strcmp' or `wcscmp'.

The function match_a_byte (memcoll?) defined here is similar to strcoll;
do we compare single line with more than one other line?

> +static inline unsigned long hash_a_byte(const char ch_, long flags)
> +{
> +	unsigned long ch = ch_ & 0xFF;
> +	if ((flags & XDF_IGNORE_CASE) && !(ch & 0x80) && isupper(ch))
> +		ch = tolower(ch);
> +	return ch;
> +}
> +

Hmmm... hash_a_byte (memxfrm?) is similar to strxfrm, so you do use
one or the other...

-- 
Jakub Narebski

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html