Re: git-diff on touched files: bug or feature?

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 02 Aug 2007 12:56:19 -0700

Matthieu Moy <Matthieu.Moy@xxxxxxx> writes:

> Junio C Hamano <gitster@xxxxxxxxx> writes:
>
>> Quite honestly, a script that indiscriminately touches everybody
>> but only modifies a few is simply broken.  Think about "make".
>> "git diff" reporting many cache-dirty files is simply reminding
>> you the brokenness of such a script.
>
> I wouldn't call this "broken", but clearly suboptimal, yes. But for an
> occasionnal one-liner (perl -pi -e ... or so), I lose less time
> recompiling extra-files than I would writting a cleaner script. "make"
> has no way to detect the absence of modification, while git has.

The first part of your sentence makes sense.  You are trading
"make" time with the reduction in effort to write a throw-away
script.  That makes sense --- and you pay with a more expensive
make, but that is a valid tradeoff you choose to make.  You can
just choose to make the same tradeoff by running an extra
refresh.

You could do something like the trivial patch attached below to
squelch diff-files "cache-dirty" output, if you wanted to.

After trying to work with this modified git for some time,
however, it has become very clear that doing this and nothing
else is a stupid thing to do (I'll suggest potential
improvements at the end).  For one thing, it obviously loses the
"quick git-diff to see what I touched" benefit.  And that issue
is real, as I first touched diff-lib.c to come up with this
until I realized that doing it in here is much cleaner and
simpler --- this patch is discarding that information, and makes
writing a changelog for this patch, if it were to be included,
more expensive for the user -- that is me.

Another thing is that it loses the coal-mine canary value the
"cache-dirty" output has.  After running a loosely written bulk
regexp script, like this:

	perl -p -i -e 's/foo/bar/' *.c

the cached stat data are destroyed for all paths, and it makes
all the subsequent git worktree operations needlessly more
expensive; existing output makes me realize this situation.

It is not a stupid thing to run such a script [*1*]; in this
case, I am choosing the convenience of using such a suboptimal
(as you said) script.  But after running such a script, I DO
WANT git to tell me that I made the index suboptimal, so that I
can and should refresh it to gain the lost performance back.

Personally, I almost never run "git status".  The command is
there primarily because other systems had a command called
"status", and migrant wondered why we didn't.  We do not need
it, and we do not have to use it.

 * For getting the status so-far, I use "git diff" (and "git
   diff ." when in a subdirectory).  It is "how have I changed
   things?", and that is when it is very useful to know "ah, I
   originally went in that direction but decided against it and
   did it differently" by the cache-dirtiness output.  The
   coal-mine canary value is also felt here;

 * For a quick final status, "git diff --stat" is much simpler
   to read than "git status", and that is what I use.  It is
   "what have I changed overall?".  As this actually counts the
   real changes, you would not see those null changes that come
   from the cache dirtiness you are complaining about;

 * When I am ready to commit, "git status" output comes in the
   commit log editor.  At that point, I already have been
   reminded by the previous "git diff" (which is usually run
   often unless the patch is very small like this), and I do not
   need cache-dirtiness output anymore after making my commit.

You do not have to say, to the above paragraph, that it is
different from your workflow.  I am showing what the opmimum
workflow would be, and it is up to you not to listen to me.

Even if we were willing to lose the "quick git-diff to see what
I touched" benefit and want to go the route of the attached
patch suggests (which I personally do not think we are), there
should be a way to either (1) tell the user that many paths are
found to be cache-dirty and it is a good idea to refresh the
stat information, after squelching diff-files output this way,
or (2) update the stat information automatically when diff-files
finds too many paths are cache-dirty (perhaps without even
telling the user).  The latter requires you to declare that
git-diff is not a read-only operation anymore, though, so you
would need some thought before going in that direction.


[Footnote]

*1* Some people might argue that "perl -i" should have a mode to
leave the original if the script does not modify the contents.
Maybe they are right, but that is an orthogonal issue.

---

 diff.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/diff.c b/diff.c
index a5fc56b..ea1239d 100644
--- a/diff.c
+++ b/diff.c
@@ -3143,6 +3143,45 @@ static void diffcore_apply_filter(const char *filter)
 	*q = outq;
 }
 
+static void diffcore_remove_empty(void)
+{
+	int i;
+	struct diff_queue_struct *q = &diff_queued_diff;
+	struct diff_queue_struct outq;
+	outq.queue = NULL;
+	outq.nr = outq.alloc = 0;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+
+		/*
+		 * 1. Keep the ones that cannot be diff-files
+		 *    "false" match that are only queued due to
+		 *    cache dirtyness.
+		 *
+		 * 2. Modified, same size and mode, and the object
+		 *    name of one side is unknown.  If they do not
+		 *    have identical contents, keep them.
+		 *    They are different.
+		 */
+		if ((p->status != DIFF_STATUS_MODIFIED) || /* (1) */
+		    (p->one->sha1_valid && p->two->sha1_valid) ||
+		    (p->one->mode != p->two->mode) ||
+
+		    diff_populate_filespec(p->one, 1) || /* (2) */
+		    diff_populate_filespec(p->two, 1) ||
+		    (p->one->size != p->two->size) ||
+		    diff_populate_filespec(p->one, 0) ||
+		    diff_populate_filespec(p->two, 0) ||
+		    memcmp(p->one->data, p->two->data, p->one->size))
+			diff_q(&outq, p);
+		else
+			diff_free_filepair(p);
+	}
+	free(q->queue);
+	*q = outq;
+}
+
 void diffcore_std(struct diff_options *options)
 {
 	if (options->quiet)
@@ -3160,6 +3199,7 @@ void diffcore_std(struct diff_options *options)
 		diffcore_order(options->orderfile);
 	diff_resolve_rename_copy();
 	diffcore_apply_filter(options->filter);
+	diffcore_remove_empty();
 
 	options->has_changes = !!diff_queued_diff.nr;
 }

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html