Re: [PATCH] Don't search files with an unset "grep" attribute

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Conrad Irwin <conrad.irwin@xxxxxxxxx> writes:

>> in order to avoid uselessly spewing garbage at you while reminding you
>> that the command is not showing the whole truth and you _might_ want to
>> look into the elided file if you wanted to with "grep -a world hello.o".
>> Compared to this, it feels that the result your patch goes too far and
>> hides the truth a bit too much to my taste. Maybe it is just me.
>
> I used to use this approach, hooking into the "diff" attribute directly to mark
> a file as binary, however that was clearly a hack.

After thinking about this a bit more, I have to say I disagree that it is
a hack.

I agree that the issue you are trying to address is real, but I think the
approach taken by the patch is flawed on three counts:

 - You cannot do "grep -a" equivalent once you set your "-grep" attribute;

 - The user has flexibility to set "diff" and "grep" independently, which
   is an unnecessary complication [*1*]; and

 - The user does not get any hint that potential hits are elided, like
   normal "grep" would do.

I also think we can fix the above problems, while simplifying the external
interface visible to the end users.

So let's step back a bit and take a look at the handling of files for
which you do not want to see patch output and/or you do not want to see
grep hits, in a fictional but realisitic use scenario.

I am building a small embedded applicance that links with a binary blob
firmware, xyzzy.so, supplied by an upstream vendor, and this blob is
updated from time to time. In order to keep track of what binary blob went
into which product release, this binary blob is stored in the repository
together with the source files. Because it is useless to view binary
changes in "git log -p" output, I would naturally mark this as "binary".

Now, is it realistic to expect that I might want to see "grep" hits from
this binary blob? I do not think so [*2*].

When I say "xyzzy.so is a binary" in my .gitattributes file, according to
the current documentation, I am saying that "do not run diff, and also do
not run EOL conversion on this file" [*3*], but from the end user's point
of view, it is natural that I am entitled to expect much more than that.
Without having to know how many different kinds of textual operations the
SCM I happen to be using supports, I am telling it that I do not want to
see _any_ textual operations done on it. If a new version of "git grep"
learns to honor the attributes system, I want my "this is binary" to be
considered by the improved "git grep".

If you think about it this way from the very high level description of the
problem, aka, end user's point of view, it is fairly clear that tying the
"binary" attribute to "git grep" to allow us to override the built-in
buffer_is_binary() call you see in grep.c gives the most intuitive result,
without forcing the user to do anything more than they are already doing.

Because "binary" decomposes to "-diff" and "-text", we have two choices.
We _could_ say "-text" should be the one that overrides buffer_is_binary()
autodetection, but using "-diff" instead for that purpose opens the door
for us to give our users even more intuitive and consistent experience.

Suppose that this binary blob firmware came with an API manual formatted
in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
repository, but again, running textual diff between updated versions of
the PDF documentation would not be very useful. I however may have a
textconv filter defined for it to grab the text by running pdf2ascii.

Now if my "git show --textconv xyzzy.pdf" has an output line that says a
string "XYZZY API 2.0" was added to the current version, wouldn't it be
natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
to find it [*4*]?

As an added bonus, if you truly want to omit _any_ hit from "git grep" for
your minified.js files, you can easily do so. Just define a textconv
filter that yields an empty string for them, and "grep --textconv" won't
produce any matches, not even "Binary file ... matches".

So I am inclined to think that reusing the infrastructure we already have
for the `diff` attribute as much as possible would be a better design for
this new feature you are proposing.

The name of the attribute `diff` is somewhat unfortunate, but the end user
interface to the feature at the highest level is already called "binary"
attribute (macro), and our low-level documentation can give precise
meaning of the attribute while alluding to the origin of the name so that
intelligent readers would understand it pretty easily (see attached
patch).

Besides, we already tell the users in "Marking files as binary" section to
mark them with "-diff" in the attributes file if they want their contents
treated as binary.


[Footnotes]

*1* An unused flexibility like this does nothing useful, other than
forcing the user to flip two attributes always as a pair, when one should
suffice.

*2* The essence of my previous message was "why it is insufficient to mark
them binary, i.e. uninteresting for the purpose of all textual operations
not just grep but also for diff." which was unanswered.

*3* This is because "binary" is defined as an attribute macro that expands
to "-diff -text".

*4* This also suggests that the infrastructure we already have for driving
"git diff" is a good match for "git grep".  The "funcname" patterns, which
"git grep -p" already can borrow and use from "diff" infrastructure, is
another indication that using `diff` is a good match for "git grep".


 Documentation/gitattributes.txt |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..63844c4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -388,9 +388,18 @@ Generating diff text
 `diff`
 ^^^^^^
 
-The attribute `diff` affects how 'git' generates diffs for particular
-files. It can tell git whether to generate a textual patch for the path
-or to treat the path as a binary file.  It can also affect what line is
+The attribute `diff` affects if `git` treats the contents of file as text
+or binary. Historically, `git diff` and its friends were the first to use
+this attribute, hence its name, but the attribute governs textual
+operations in general, including `git grep`.
+
+For `git diff` and its friends, differences in contents marked as text
+produces a textual patch, while differences in non-text are reported only
+as "Binary files differ". For `git grep`, matches in contents marked as
+text are reported by showing lines that contain them, while matches in
+non-text are reported only as "Binary file ... matches".
+
+This attribute can also affect what line is
 shown on the hunk header `@@ -k,l +n,m @@` line, tell git to use an
 external command to generate the diff, or ask git to convert binary
 files to a text format before generating the diff.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]