On Mon, Jan 23, 2012 at 10:59:45PM -0800, Junio C Hamano wrote: > Conrad Irwin <conrad.irwin@xxxxxxxxx> writes: > > I used to use this approach, hooking into the "diff" attribute directly to mark > > a file as binary, however that was clearly a hack. > > After thinking about this a bit more, I have to say I disagree that it is > a hack. I kind of agree. The biggest problem is that the name is wrong. The "diff.*.command" option really is about generating a diff between two blobs of a certain type. But "diff.*.textconv" and "diff.*.binary" are really just attributes of the file, and may or may not have to do with generating a diff. Ditto for diff.*.funcname, I think. You argue, and I agree, that if we are talking about attributes of the files and not diff-specific things, then other parts of git can and should make use of that information. So if this was all spelled: $ cat .gitattributes *.pdf filetype=pdf $ cat .git/config [filetype "pdf"] binary = true textconv = pdf2txt I think it would be a no-brainer that those type attributes should apply to "git grep". So maybe the first step on this path would be to introduce something like "filetype" as a new attribute, have "diff" respect its settings, and recommend people set up filetypes as appropriate. Or maybe that just makes things more confusing in the long run, and we are better off simply accepting that the name is slightly misleading. But either way, it seems clear that git should be respecting gitattributes at the very least to mark files as binary (and I think we already use funcname patterns; textconv is a slightly stickier subject, so I'll address that below). But what I'm not sure I agree with is that the idea of "I don't want to include path X in my grep" maps to "just mark the file as binary". For example, in git we carry around a lot of code in compat/ that comes from other places. I generally don't want to see grep results from compat/nedmalloc/, because that isn't git code. But should I mark everything in compat/nedmalloc as binary? I don't think so. I _do_ want to see changes in nedmalloc in "git log" or "git diff". They don't bother me because they're infrequent, and we still want to produce regular text patches for the list when they come up. But because nedmalloc contains a lot of lines of code (even though they don't change a lot), it happens to produce a lot of uninteresting matches when grepping. > - The user has flexibility to set "diff" and "grep" independently, which > is an unnecessary complication [*1*]; and In the nedmalloc case, if we are to have "grep" and "diff" attributes that behave similarly, you potentially do want to set them differently. It would be nice to be able to treat them differently in the cases you wanted to, but not _have_ to do so. Attribute macros can almost implement this. You could add "-grep" to binary. But you can't (as far as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we could extend the rules to allow macros that take an argument and pass it to their constituent attributes. However, I think this is the wrong road to go down. You would want macros like this _if_ you have grep and diff attributes that basically do the same thing (e.g., marking a file as binary for diff versus binary for grep). But I think that's a wrong road to go down. More likely a file is binary or it is not binary, and the problem is conflating "file is binary" with "I do not usually want to grep this file". I'd much rather see grep inherit diff's file attributes unconditionally (whether we still call them "diff" or not), and add a grep attribute that is about "usually this is worth grepping". And then have a tri-state command-line option for grep to either include uninteresting things, exclude them, or to give terse output for them (mention that there were matches in foo.c, but not each one). Probably defaulting to terse output. That makes the case you presented work out of the box: things marked as binary for diffing look binary to grep, and we give the usual terse "binary file matches" output. The user doesn't have to do anything. For more complex cases like nedmalloc, you can still achieve the "this is text, but it's usually boring" behavior. And if you really want to do a thorough grep, you can just "git grep --exclude=none". > So let's step back a bit and take a look at the handling of files for > which you do not want to see patch output and/or you do not want to see > grep hits, in a fictional but realisitic use scenario. > [...] I think this is a nice user story, and it fits in with your suggested git behavior. But I think there are other stories, too, like the nedmalloc one. And it would be nice to make them work without hurting the simplicity of the case you mentioned. > If you think about it this way from the very high level description of the > problem, aka, end user's point of view, it is fairly clear that tying the > "binary" attribute to "git grep" to allow us to override the built-in > buffer_is_binary() call you see in grep.c gives the most intuitive result, > without forcing the user to do anything more than they are already doing. This is not a complaint about the core of your point, but rather an aside that should be considered: how many people are really using the binary macro attribute? Personally, I never use it, because when I mark something with a "diff" attribute, it is because I am telling git about a specific diff driver (usually textconv). Otherwise I don't bother setting attributes at all, because git's binary detection tends to be good. This leaves me without setting "-text", of course, but I don't generally care because I don't do CRLF conversion at all. So I think it is worth considering not just people setting "binary", but how users of just "diff" (both "-diff" and "diff=foo") will want things to work. > Suppose that this binary blob firmware came with an API manual formatted > in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the > repository, but again, running textual diff between updated versions of > the PDF documentation would not be very useful. I however may have a > textconv filter defined for it to grab the text by running pdf2ascii. > > Now if my "git show --textconv xyzzy.pdf" has an output line that says a > string "XYZZY API 2.0" was added to the current version, wouldn't it be > natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf" > to find it [*4*]? This is an interesting concept. As a user of textconv, I already have some specialized grep tools for matching inside binary files (e.g., I have a tool that greps within exif tags of images). But being able to do so with "git grep", and even at an arbitrary revision, is kind of neat. I would worry about turning it on by default, since the results could be misleading. In particular, your pattern "foo" might be in the binary file but not in the textconv'd version, leading you to think you had found all instances of "foo" but had not (or much more subtle, things like line breaks really matter during the conversion if you are going to be using "grep -C"). Making it available by "--textconv" seems reasonable to me, though. The only inconsistency is that it's on by default for "git show", but would not be for "git grep". Perhaps I am being overly paranoid on the "misleading" bit above. It seems to me that grep has the room to be a lot more subtle, because an omission from the output is considered "did not match". But you could construct equally weird scenarios for "git show" (e.g., you changed "foo" to "bar" but that part did not appear in the textconv portion. Which is really a quality-of-implementation issue for your textconv filter). > As an added bonus, if you truly want to omit _any_ hit from "git grep" for > your minified.js files, you can easily do so. Just define a textconv > filter that yields an empty string for them, and "grep --textconv" won't > produce any matches, not even "Binary file ... matches". Clever. But then you will never ever see a diff for that file, either, because we will consider all changes to be empty (actually, I didn't check, but you may get the diff header without any content, similar to the stat-dirty entries). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html