Jeff King venit, vidit, dixit 05.02.2013 12:13: > On Mon, Feb 04, 2013 at 04:27:31PM +0100, Michael J Gruber wrote: > >> Recently and not so recently, we made sure that log/grep type operations >> use textconv filters when a userfacing diff would do the same: >> >> ef90ab6 (pickaxe: use textconv for -S counting, 2012-10-28) >> b1c2f57 (diff_grep: use textconv buffers for add/deleted files, 2012-10-28) >> 0508fe5 (combine-diff: respect textconv attributes, 2011-05-23) >> >> "git grep" currently does not use textconv filters at all, that is >> neither for displaying the match and context nor for the actual grepping. >> >> Introduce a binary mode "--textconv" (in addition to "--text" and "-I") >> which makes git grep use any configured textconv filters for grepping >> and output purposes. > > Sounds like a reasonable goal. > >> The difficulty is in getting the different cases (blob/sha1 vs. >> worktree) right, and in making the changes minimally invasive. It seems >> that some more refactoring could help: "git show --textconv" does not >> use textconv filters when used on blobs either. (It does for diffs, of >> course.) Most existing helper functions are tailored for diffs. > > I think "git show" with blobs originally did not because we have no > filename with which to look up the attributes. IIRC, the patches to > support "cat-file --textconv" taught get_sha1_with_context to report the > path at which we found a blob. I suspect it is mostly a matter of > plumbing that information from the revision parser through to > show_blob_object. > >> Nota bene: --textconv does not affect "diff --stat" either... > > Yeah, though I wonder if it should be on by default. The diffstat for a > binary file, unlike the diff, is already useful. The diffstat of the > textconv'd data may _also_ be useful, of course. > >> @@ -659,6 +659,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix) >> OPT_SET_INT('I', NULL, &opt.binary, >> N_("don't match patterns in binary files"), >> GREP_BINARY_NOMATCH), >> + OPT_SET_INT(0, "textconv", &opt.binary, >> + N_("process binary files with textconv filters"), >> + GREP_BINARY_TEXTCONV), > > Is this really a new form of GREP_BINARY_*? What happens when a file > does not have a textconv filter? > > I would expect this to be more like diff's "--textconv" flag, which is > really "allow textconv to be respected". Then you could do: > > git grep -I --textconv foo > > to grep in the text version of files which support it, and ignore the > rest. > >> -static int grep_source_load(struct grep_source *gs); >> -static int grep_source_is_binary(struct grep_source *gs); >> +static int grep_source_load(struct grep_source *gs, struct grep_opt *opt); >> +static int grep_source_is_binary(struct grep_source *gs, struct grep_opt *opt); > > Hmm. grep_source_load is more or less the analogue of > diff_populate_filespec, which does not know about textconv at all. So I > feel like this might be going in at the wrong layer... > >> @@ -1354,14 +1356,15 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle >> >> switch (opt->binary) { >> case GREP_BINARY_DEFAULT: >> - if (grep_source_is_binary(gs)) >> + if (grep_source_is_binary(gs, opt)) >> binary_match_only = 1; >> break; >> case GREP_BINARY_NOMATCH: >> - if (grep_source_is_binary(gs)) >> + if (grep_source_is_binary(gs, opt)) >> return 0; /* Assume unmatch */ >> break; > > The is_binary function learned about "opt" so that it could pass it to > grep_source_load, which might do the textconv for us. But we _know_ that > we will not here, because we see that we have other GREP_BINARY flags. > > And when we do have _TEXTCONV: > >> case GREP_BINARY_TEXT: >> + case GREP_BINARY_TEXTCONV: >> break; > > We do not call is_binary. :) > >> -static int grep_source_load_sha1(struct grep_source *gs) >> +static int grep_source_load_sha1(struct grep_source *gs, struct grep_opt *opt) >> { >> enum object_type type; >> - >> grep_read_lock(); >> - gs->buf = read_sha1_file(gs->identifier, &type, &gs->size); >> + if (opt->binary == GREP_BINARY_TEXTCONV) { >> + struct diff_filespec *df = alloc_filespec(gs->name); >> + gs->size = fill_textconv(gs->driver, df, &gs->buf); >> + free_filespec(df); >> + } else { >> + gs->buf = read_sha1_file(gs->identifier, &type, &gs->size); >> + } > > So here we do the textconv for the sha1 case. But what about file > sources? > > This is why I think the layer is wrong; you want the fill_textconv > function to call your abstract _load function (in the diff_filespec > world, it is diff_filespec_populate, but it is the moral equivalent). > And you want it to hold off as long as possible in case we can pull the > value from cache, or feed the working tree version of a file straight to > the filter. > >> -void grep_source_load_driver(struct grep_source *gs) >> +void grep_source_load_driver(struct grep_source *gs, struct grep_opt *opt) >> { >> if (gs->driver) >> return; >> >> - grep_attr_lock(); >> + grep_attr_lock(); //TODO >> + printf("Looking up userdiff driver for: %s", gs->path); >> if (gs->path) >> gs->driver = userdiff_find_by_path(gs->path); >> if (!gs->driver) >> gs->driver = userdiff_find_by_name("default"); >> + if (opt->binary == GREP_BINARY_TEXTCONV) >> + gs->driver = userdiff_get_textconv(gs->driver); >> grep_attr_unlock(); >> } > > This is wrong. The point of userdiff_get_textconv is that it will return > NULL when we are not doing textconv for this path. So you can use it > like: > > struct userdiff_driver *textconv = userdiff_get_textconv(gs->driver); > > if (textconv) { > /* ok, we are doing textconv. Call our fill_textconv > * equivalent. */ > } > else { > /* nope, plain old file. */ > } > > But by assigning it on top of gs->driver, you're going to end up with a > NULL driver sometimes. And the post-condition of the load_driver > function is that gs->driver always points to a valid driver (even if it > is the default one). I wouldn't be surprised if this causes segfaults. > > > So I would do it more like the patch below. Only lightly tested by me. > There are some refactoring opportunities if you want to bring > grep_source and diff_filespec closer together. > > --- > diff --git a/builtin/grep.c b/builtin/grep.c > index 8025964..915c8ef 100644 > --- a/builtin/grep.c > +++ b/builtin/grep.c > @@ -659,6 +659,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix) > OPT_SET_INT('I', NULL, &opt.binary, > N_("don't match patterns in binary files"), > GREP_BINARY_NOMATCH), > + OPT_BOOL(0, "textconv", &opt.allow_textconv, > + N_("process binary files with textconv filters")), > { OPTION_INTEGER, 0, "max-depth", &opt.max_depth, N_("depth"), > N_("descend at most <depth> levels"), PARSE_OPT_NONEG, > NULL, 1 }, > diff --git a/grep.c b/grep.c > index 4bd1b8b..3880d64 100644 > --- a/grep.c > +++ b/grep.c > @@ -2,6 +2,8 @@ > #include "grep.h" > #include "userdiff.h" > #include "xdiff-interface.h" > +#include "diff.h" > +#include "diffcore.h" > > static int grep_source_load(struct grep_source *gs); > static int grep_source_is_binary(struct grep_source *gs); > @@ -1321,6 +1323,58 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size) > fwrite(buf, size, 1, stdout); > } > > +static int fill_textconv_grep(struct userdiff_driver *driver, > + struct grep_source *gs) > +{ > + struct diff_filespec *df; > + char *buf; > + size_t size; > + > + if (!driver || !driver->textconv) > + return grep_source_load(gs); > + > + /* > + * The textconv interface is intimately tied to diff_filespecs, so we > + * have to pretend to be one. If we could unify the grep_source > + * and diff_filespec structs, this mess could just go away. > + */ > + df = alloc_filespec(gs->path); > + switch (gs->type) { > + case GREP_SOURCE_SHA1: > + fill_filespec(df, gs->identifier, 1, 0100644); > + break; > + case GREP_SOURCE_FILE: > + fill_filespec(df, null_sha1, 0, 0100644); > + break; > + default: > + die("BUG: attempt to textconv something without a path?"); > + } > + > + /* > + * fill_textconv is not remotely thread-safe; it may load objects > + * behind the scenes, and it modifies the global diff tempfile > + * structure. > + */ > + grep_read_lock(); > + size = fill_textconv(driver, df, &buf); > + grep_read_unlock(); > + free_filespec(df); > + > + /* > + * The normal fill_textconv usage by the diff machinery would just keep > + * the textconv'd buf separate from the diff_filespec. But much of the > + * grep code passes around a grep_source and assumes that its "buf" > + * pointer is the beginning of the thing we are searching. So let's > + * install our textconv'd version into the grep_source, taking care not > + * to leak any existing buffer. > + */ > + grep_source_clear_data(gs); > + gs->buf = buf; > + gs->size = size; > + > + return 0; > +} > + > static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int collect_hits) > { > char *bol; > @@ -1331,6 +1385,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle > unsigned count = 0; > int try_lookahead = 0; > int show_function = 0; > + struct userdiff_driver *textconv = NULL; > enum grep_context ctx = GREP_CONTEXT_HEAD; > xdemitconf_t xecfg; > > @@ -1352,19 +1407,36 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle > } > opt->last_shown = 0; > > - switch (opt->binary) { > - case GREP_BINARY_DEFAULT: > - if (grep_source_is_binary(gs)) > - binary_match_only = 1; > - break; > - case GREP_BINARY_NOMATCH: > - if (grep_source_is_binary(gs)) > - return 0; /* Assume unmatch */ > - break; > - case GREP_BINARY_TEXT: > - break; > - default: > - die("bug: unknown binary handling mode"); > + if (opt->allow_textconv) { > + grep_source_load_driver(gs); > + /* > + * We might set up the shared textconv cache data here, which > + * is not thread-safe. > + */ > + grep_attr_lock(); > + textconv = userdiff_get_textconv(gs->driver); > + grep_attr_unlock(); > + } > + > + /* > + * We know the result of a textconv is text, so we only have to care > + * about binary handling if we are not using it. > + */ > + if (!textconv) { > + switch (opt->binary) { > + case GREP_BINARY_DEFAULT: > + if (grep_source_is_binary(gs)) > + binary_match_only = 1; > + break; > + case GREP_BINARY_NOMATCH: > + if (grep_source_is_binary(gs)) > + return 0; /* Assume unmatch */ > + break; > + case GREP_BINARY_TEXT: > + break; > + default: > + die("bug: unknown binary handling mode"); > + } > } > > memset(&xecfg, 0, sizeof(xecfg)); > @@ -1372,7 +1444,7 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle > > try_lookahead = should_lookahead(opt); > > - if (grep_source_load(gs) < 0) > + if (fill_textconv_grep(textconv, gs) < 0) > return 0; > > bol = gs->buf; > diff --git a/grep.h b/grep.h > index 8fc854f..94a7ac2 100644 > --- a/grep.h > +++ b/grep.h > @@ -106,6 +106,7 @@ struct grep_opt { > #define GREP_BINARY_NOMATCH 1 > #define GREP_BINARY_TEXT 2 > int binary; > + int allow_textconv; > int extended; > int use_reflog_filter; > int pcre; > Thanks Jeff, that helps a lot! It covers "grep expr" and "grep expr rev -- path" just fine. I'll look into "grep expr rev:path" which does not work yet because of an empty driver. I also have "show --textconv" covered and a suggestion for "cat-file --textconv" (to work without a textconv filter). Expect a mini-series soon :) Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html