Re: [WIP/RFH/RFD/PATCH] grep: allow to use textconv filters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jeff King venit, vidit, dixit 05.02.2013 12:13:
> On Mon, Feb 04, 2013 at 04:27:31PM +0100, Michael J Gruber wrote:
> 
>> Recently and not so recently, we made sure that log/grep type operations
>> use textconv filters when a userfacing diff would do the same:
>>
>> ef90ab6 (pickaxe: use textconv for -S counting, 2012-10-28)
>> b1c2f57 (diff_grep: use textconv buffers for add/deleted files, 2012-10-28)
>> 0508fe5 (combine-diff: respect textconv attributes, 2011-05-23)
>>
>> "git grep" currently does not use textconv filters at all, that is
>> neither for displaying the match and context nor for the actual grepping.
>>
>> Introduce a binary mode "--textconv" (in addition to "--text" and "-I")
>> which makes git grep use any configured textconv filters for grepping
>> and output purposes.
> 
> Sounds like a reasonable goal.
> 
>>     The difficulty is in getting the different cases (blob/sha1 vs.
>>     worktree) right, and in making the changes minimally invasive. It seems
>>     that some more refactoring could help: "git show --textconv" does not
>>     use textconv filters when used on blobs either. (It does for diffs, of
>>     course.) Most existing helper functions are tailored for diffs.
> 
> I think "git show" with blobs originally did not because we have no
> filename with which to look up the attributes. IIRC, the patches to
> support "cat-file --textconv" taught get_sha1_with_context to report the
> path at which we found a blob. I suspect it is mostly a matter of
> plumbing that information from the revision parser through to
> show_blob_object.
> 
>>     Nota bene: --textconv does not affect "diff --stat" either...
> 
> Yeah, though I wonder if it should be on by default. The diffstat for a
> binary file, unlike the diff, is already useful. The diffstat of the
> textconv'd data may _also_ be useful, of course.
> 
>> @@ -659,6 +659,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>>  		OPT_SET_INT('I', NULL, &opt.binary,
>>  			N_("don't match patterns in binary files"),
>>  			GREP_BINARY_NOMATCH),
>> +		OPT_SET_INT(0, "textconv", &opt.binary,
>> +			N_("process binary files with textconv filters"),
>> +			GREP_BINARY_TEXTCONV),
> 
> Is this really a new form of GREP_BINARY_*? What happens when a file
> does not have a textconv filter?
> 
> I would expect this to be more like diff's "--textconv" flag, which is
> really "allow textconv to be respected". Then you could do:
> 
>   git grep -I --textconv foo
> 
> to grep in the text version of files which support it, and ignore the
> rest.
> 
>> -static int grep_source_load(struct grep_source *gs);
>> -static int grep_source_is_binary(struct grep_source *gs);
>> +static int grep_source_load(struct grep_source *gs, struct grep_opt *opt);
>> +static int grep_source_is_binary(struct grep_source *gs, struct grep_opt *opt);
> 
> Hmm. grep_source_load is more or less the analogue of
> diff_populate_filespec, which does not know about textconv at all. So I
> feel like this might be going in at the wrong layer...
> 
>> @@ -1354,14 +1356,15 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>>  
>>  	switch (opt->binary) {
>>  	case GREP_BINARY_DEFAULT:
>> -		if (grep_source_is_binary(gs))
>> +		if (grep_source_is_binary(gs, opt))
>>  			binary_match_only = 1;
>>  		break;
>>  	case GREP_BINARY_NOMATCH:
>> -		if (grep_source_is_binary(gs))
>> +		if (grep_source_is_binary(gs, opt))
>>  			return 0; /* Assume unmatch */
>>  		break;
> 
> The is_binary function learned about "opt" so that it could pass it to
> grep_source_load, which might do the textconv for us. But we _know_ that
> we will not here, because we see that we have other GREP_BINARY flags.
> 
> And when we do have _TEXTCONV:
> 
>>  	case GREP_BINARY_TEXT:
>> +	case GREP_BINARY_TEXTCONV:
>>  		break;
> 
> We do not call is_binary. :)
> 
>> -static int grep_source_load_sha1(struct grep_source *gs)
>> +static int grep_source_load_sha1(struct grep_source *gs, struct grep_opt *opt)
>>  {
>>  	enum object_type type;
>> -
>>  	grep_read_lock();
>> -	gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
>> +	if (opt->binary == GREP_BINARY_TEXTCONV) {
>> +		struct diff_filespec *df = alloc_filespec(gs->name);
>> +		gs->size = fill_textconv(gs->driver, df, &gs->buf);
>> +		free_filespec(df);
>> +	} else {
>> +		gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
>> +	}
> 
> So here we do the textconv for the sha1 case. But what about file
> sources?
> 
> This is why I think the layer is wrong; you want the fill_textconv
> function to call your abstract _load function (in the diff_filespec
> world, it is diff_filespec_populate, but it is the moral equivalent).
> And you want it to hold off as long as possible in case we can pull the
> value from cache, or feed the working tree version of a file straight to
> the filter.
> 
>> -void grep_source_load_driver(struct grep_source *gs)
>> +void grep_source_load_driver(struct grep_source *gs, struct grep_opt *opt)
>>  {
>>  	if (gs->driver)
>>  		return;
>>  
>> -	grep_attr_lock();
>> +	grep_attr_lock(); //TODO
>> +	printf("Looking up userdiff driver for: %s", gs->path);
>>  	if (gs->path)
>>  		gs->driver = userdiff_find_by_path(gs->path);
>>  	if (!gs->driver)
>>  		gs->driver = userdiff_find_by_name("default");
>> +	if (opt->binary == GREP_BINARY_TEXTCONV)
>> +		gs->driver = userdiff_get_textconv(gs->driver);
>>  	grep_attr_unlock();
>>  }
> 
> This is wrong. The point of userdiff_get_textconv is that it will return
> NULL when we are not doing textconv for this path. So you can use it
> like:
> 
>   struct userdiff_driver *textconv = userdiff_get_textconv(gs->driver);
> 
>   if (textconv) {
>           /* ok, we are doing textconv. Call our fill_textconv
>            * equivalent. */
>   }
>   else {
>           /* nope, plain old file. */
>   }
> 
> But by assigning it on top of gs->driver, you're going to end up with a
> NULL driver sometimes. And the post-condition of the load_driver
> function is that gs->driver always points to a valid driver (even if it
> is the default one). I wouldn't be surprised if this causes segfaults.
> 
> 
> So I would do it more like the patch below. Only lightly tested by me.
> There are some refactoring opportunities if you want to bring
> grep_source and diff_filespec closer together.
> 
> ---
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 8025964..915c8ef 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -659,6 +659,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>  		OPT_SET_INT('I', NULL, &opt.binary,
>  			N_("don't match patterns in binary files"),
>  			GREP_BINARY_NOMATCH),
> +		OPT_BOOL(0, "textconv", &opt.allow_textconv,
> +			 N_("process binary files with textconv filters")),
>  		{ OPTION_INTEGER, 0, "max-depth", &opt.max_depth, N_("depth"),
>  			N_("descend at most <depth> levels"), PARSE_OPT_NONEG,
>  			NULL, 1 },
> diff --git a/grep.c b/grep.c
> index 4bd1b8b..3880d64 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -2,6 +2,8 @@
>  #include "grep.h"
>  #include "userdiff.h"
>  #include "xdiff-interface.h"
> +#include "diff.h"
> +#include "diffcore.h"
>  
>  static int grep_source_load(struct grep_source *gs);
>  static int grep_source_is_binary(struct grep_source *gs);
> @@ -1321,6 +1323,58 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
>  	fwrite(buf, size, 1, stdout);
>  }
>  
> +static int fill_textconv_grep(struct userdiff_driver *driver,
> +			      struct grep_source *gs)
> +{
> +	struct diff_filespec *df;
> +	char *buf;
> +	size_t size;
> +
> +	if (!driver || !driver->textconv)
> +		return grep_source_load(gs);
> +
> +	/*
> +	 * The textconv interface is intimately tied to diff_filespecs, so we
> +	 * have to pretend to be one. If we could unify the grep_source
> +	 * and diff_filespec structs, this mess could just go away.
> +	 */
> +	df = alloc_filespec(gs->path);
> +	switch (gs->type) {
> +	case GREP_SOURCE_SHA1:
> +		fill_filespec(df, gs->identifier, 1, 0100644);
> +		break;
> +	case GREP_SOURCE_FILE:
> +		fill_filespec(df, null_sha1, 0, 0100644);
> +		break;
> +	default:
> +		die("BUG: attempt to textconv something without a path?");
> +	}
> +
> +	/*
> +	 * fill_textconv is not remotely thread-safe; it may load objects
> +	 * behind the scenes, and it modifies the global diff tempfile
> +	 * structure.
> +	 */
> +	grep_read_lock();
> +	size = fill_textconv(driver, df, &buf);
> +	grep_read_unlock();
> +	free_filespec(df);
> +
> +	/*
> +	 * The normal fill_textconv usage by the diff machinery would just keep
> +	 * the textconv'd buf separate from the diff_filespec. But much of the
> +	 * grep code passes around a grep_source and assumes that its "buf"
> +	 * pointer is the beginning of the thing we are searching. So let's
> +	 * install our textconv'd version into the grep_source, taking care not
> +	 * to leak any existing buffer.
> +	 */
> +	grep_source_clear_data(gs);
> +	gs->buf = buf;
> +	gs->size = size;
> +
> +	return 0;
> +}
> +
>  static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int collect_hits)
>  {
>  	char *bol;
> @@ -1331,6 +1385,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>  	unsigned count = 0;
>  	int try_lookahead = 0;
>  	int show_function = 0;
> +	struct userdiff_driver *textconv = NULL;
>  	enum grep_context ctx = GREP_CONTEXT_HEAD;
>  	xdemitconf_t xecfg;
>  
> @@ -1352,19 +1407,36 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>  	}
>  	opt->last_shown = 0;
>  
> -	switch (opt->binary) {
> -	case GREP_BINARY_DEFAULT:
> -		if (grep_source_is_binary(gs))
> -			binary_match_only = 1;
> -		break;
> -	case GREP_BINARY_NOMATCH:
> -		if (grep_source_is_binary(gs))
> -			return 0; /* Assume unmatch */
> -		break;
> -	case GREP_BINARY_TEXT:
> -		break;
> -	default:
> -		die("bug: unknown binary handling mode");
> +	if (opt->allow_textconv) {
> +		grep_source_load_driver(gs);
> +		/*
> +		 * We might set up the shared textconv cache data here, which
> +		 * is not thread-safe.
> +		 */
> +		grep_attr_lock();
> +		textconv = userdiff_get_textconv(gs->driver);
> +		grep_attr_unlock();
> +	}
> +
> +	/*
> +	 * We know the result of a textconv is text, so we only have to care
> +	 * about binary handling if we are not using it.
> +	 */
> +	if (!textconv) {
> +		switch (opt->binary) {
> +		case GREP_BINARY_DEFAULT:
> +			if (grep_source_is_binary(gs))
> +				binary_match_only = 1;
> +			break;
> +		case GREP_BINARY_NOMATCH:
> +			if (grep_source_is_binary(gs))
> +				return 0; /* Assume unmatch */
> +			break;
> +		case GREP_BINARY_TEXT:
> +			break;
> +		default:
> +			die("bug: unknown binary handling mode");
> +		}
>  	}
>  
>  	memset(&xecfg, 0, sizeof(xecfg));
> @@ -1372,7 +1444,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
>  
>  	try_lookahead = should_lookahead(opt);
>  
> -	if (grep_source_load(gs) < 0)
> +	if (fill_textconv_grep(textconv, gs) < 0)
>  		return 0;
>  
>  	bol = gs->buf;
> diff --git a/grep.h b/grep.h
> index 8fc854f..94a7ac2 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -106,6 +106,7 @@ struct grep_opt {
>  #define GREP_BINARY_NOMATCH	1
>  #define GREP_BINARY_TEXT	2
>  	int binary;
> +	int allow_textconv;
>  	int extended;
>  	int use_reflog_filter;
>  	int pcre;
> 

Thanks Jeff, that helps a lot! It covers "grep expr" and "grep expr rev
-- path" just fine. I'll look into "grep expr rev:path" which does not
work yet because of an empty driver.

I also have "show --textconv" covered and a suggestion for "cat-file
--textconv" (to work without a textconv filter).

Expect a mini-series soon :)

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]