Re: [PATCH] log: re-encode commit messages before grepping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff King <peff@xxxxxxxx> writes:

>   1. I suppose we could also use $LANG or one of the $LC_* variables to
>      guess at the encoding of the user's pattern. But I think using the
>      output encoding makes the most sense, since then the pattern you
>      searched for will actually be in the output.

I agree.  In addition, if we were to do anything with LANG/LC_CTYPE,
it should be done at the layer that implements log-output-encoding
(e.g. lack of configured encoding with nonstandard LANG/LC_CTYPE
would use the locale, or something), I think.

>   2. There are still problems with utf8 normalization. E.g., my tests
>      represent utf-8 é with \xc3\xa9 (the code point for that glyph),
>      but it could also be represented by \x65\xcc\x81 (e + combining
>      acute). But that is not a new problem; it is an inherent issue with
>      grepping utf8. We might in the future want to offer an option to
>      normalize utf8 (or possibly the regex library can be taught to
>      handle this).

True; in either case, this caller (or any other callers) should
care.  Only grep_buffer() (actually, grep_source_1()) needs to be
taught about it.

>   4. I'm still not clear on why "--graph --no-walk" wants to look at
>      commit_match after we have already cleared the commit buffer. I
>      agree it's nonsensical, but I wonder if it might be a symptom of an
>      underlying bug or inefficiency.

Yeah, that may be something we may want to check, I agree.

The aded test is also nice.  Thanks.

> diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
> new file mode 100755
> index 0000000..52a7472
> --- /dev/null
> +++ b/t/t4210-log-i18n.sh
> @@ -0,0 +1,58 @@
> +#!/bin/sh
> +
> +test_description='test log with i18n features'
> +. ./test-lib.sh
> +
> +# two forms of é
> +utf8_e=$(printf '\303\251')
> +latin1_e=$(printf '\351')
> +
> +test_expect_success 'create commits in different encodings' '
> +	test_tick &&
> +	cat >msg <<-EOF &&
> +	utf8
> +
> +	t${utf8_e}st
> +	EOF
> +	git add msg &&
> +	git -c i18n.commitencoding=utf8 commit -F msg &&
> +	cat >msg <<-EOF &&
> +	latin1
> +
> +	t${latin1_e}st
> +	EOF
> +	git add msg &&
> +	git -c i18n.commitencoding=ISO-8859-1 commit -F msg
> +'
> +
> +test_expect_success 'log --grep searches in log output encoding (utf8)' '
> +	cat >expect <<-\EOF &&
> +	latin1
> +	utf8
> +	EOF
> +	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'log --grep searches in log output encoding (latin1)' '
> +	cat >expect <<-\EOF &&
> +	latin1
> +	utf8
> +	EOF
> +	git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
> +	>expect &&
> +	git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
> +	>expect &&
> +	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]