Re: [PATCH] grep -I: do not bother to read known-binary files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Stepan Kasal <kasal@xxxxxx> writes:

> From: Johannes Schindelin <johannes.schindelin@xxxxxx>
> Date: Mon, 8 Nov 2010 16:10:43 +0100
>
> Incidentally, this makes grep -I respect the "binary" attribute (actually,
> the "-text" attribute, but "binary" implies that).
>
> Since the attributes are not thread-safe, we now need to switch off
> threading if -I was passed.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx>
> Signed-off-by: Stepan Kasal <kasal@xxxxxx>
> ---
>
> Hi,
> this patch has been in msysgit for 3.5 years.
> Stepan

I do not think checking 'text' is the right way to do this.  The
attribute controls the eof normalization, and people sometimes want
to keep CRLF terminated files in the repository no matter what the
platform is (an example I heard is a sample protocol exchange over
textual network protocol such as HTTP and SMTP), and the way to do
so is to unset it.  That would still let them look for patterns in
and compare the changes to these paths.

Looking for "Marking files as binary" in gitattributes(5) gives us a
more authoritative alternative, I think.  In short:

 - If 'diff' is Unset, or
 - If 'diff' is Set to X and diff.X.binary is true

then the contents are not suitable for human consumption.  I haven't
thought things through to declare that "grep -I" would want to treat
a PostScript file (which is an example in that section) as a binary
file and refrain from trying to find substrings in it, but my gut
feeling is that we probably should let it look inside such a file,
so replacing 'text' with 'diff' in this patch and doing nothing
about diff.*.binary would be the way to go.  Given that the posted
patch was older than 3.5 years, perhaps it needs updating to adhere
the advice given in ab435611 (docs: explain diff.*.binary option,
2011-01-09).

By the way, I wonder if the patch is about performance, implied by
"do not bother" (to waste time looking inside), though.  Is it an
overall win to avoid looking inside "-diff" files, if that requires
you to use only 1 core and leaving the other 7 idle?

I think "the user tells us it is binary with attribute, and the user
tells us not to look into binary with -I" is a lot better rationale
to do this change , and that would characterise it as a correctness
patch, not a performance patch.

Thanks.


>  builtin/grep.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 43af5b7..8073fbe 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -18,6 +18,7 @@
>  #include "quote.h"
>  #include "dir.h"
>  #include "pathspec.h"
> +#include "attr.h"
>  
>  static char const * const grep_usage[] = {
>  	N_("git grep [options] [-e] <pattern> [<rev>...] [[--] <path>...]"),
> @@ -163,6 +164,22 @@ static void work_done(struct work_item *w)
>  	grep_unlock();
>  }
>  
> +static int skip_binary(struct grep_opt *opt, const char *filename)
> +{
> +	if ((opt->binary & GREP_BINARY_NOMATCH)) {
> +		static struct git_attr *attr_text;
> +		struct git_attr_check check;
> +
> +		if (!attr_text)
> +			attr_text = git_attr("text");
> +		memset(&check, 0, sizeof(check));
> +		check.attr = attr_text;
> +		return !git_check_attr(filename, 1, &check) &&
> +				ATTR_FALSE(check.value);
> +	}
> +	return 0;
> +}
> +
>  static void *run(void *arg)
>  {
>  	int hit = 0;
> @@ -173,6 +190,9 @@ static void *run(void *arg)
>  		if (!w)
>  			break;
>  
> +		if (skip_binary(opt, (const char *)w->source.identifier))
> +			continue;
> +
>  		opt->output_priv = w;
>  		hit |= grep_source(opt, &w->source);
>  		grep_source_clear_data(&w->source);
> @@ -379,6 +399,9 @@ static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int
>  			continue;
>  		if (!ce_path_match(ce, pathspec, NULL))
>  			continue;
> +		if (skip_binary(opt, ce->name))
> +			continue;
> +
>  		/*
>  		 * If CE_VALID is on, we assume worktree file and its cache entry
>  		 * are identical, even if worktree file has been modified, so use
> @@ -803,6 +826,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>  		string_list_append(&path_list, show_in_pager);
>  		use_threads = 0;
>  	}
> +	if ((opt.binary & GREP_BINARY_NOMATCH))
> +		use_threads = 0;
>  
>  	if (!opt.pattern_list)
>  		die(_("no pattern given."));
> -- 
> 1.9.2.msysgit.0.335.gd2a461f
>
> -- 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]