[PATCH] Racy git: avoid having to be always too careful

Junio C Hamano <junkio@xxxxxxx> · Sat, 05 Aug 2006 04:20:26 -0700

Immediately after a bulk checkout, most of the paths in the
working tree would have the same timestamp as the index file,
and this would force ce_match_stat() to take slow path for all
of them.  When writing an index file out, if many of the paths
have very new (read: the same timestamp as the index file being
written out) timestamp, we are better off delaying the return
from the command, to make sure that later command to touch the
working tree files will leave newer timestamps than recorded in
the index, thereby avoiding to take the slow path.

Signed-off-by: Junio C Hamano <junkio@xxxxxxx>
---

Linus Torvalds <torvalds@xxxxxxxx> writes:

> Doing an "strace" on "git diff" shows that we close() a file descriptor 
> twice (getting EBADFD on the second one) when we end up in ce_compare_data 
> if the index does not match the checked-out stat information.
>
> The "index_fd()" function will already have closed the fd for us, so we 
> should not close it again.
>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>
> ---
>
> The way I found this also showed a potential performance problem: if you 
> do a "git reset --hard" (or similar) after you have changes in your tree, 
> it will write the index file with the same timestamp as the checked out 
> files that it re-wrote.
>
> That will also then forever afterwards (well, until the next "git 
> update-index --refresh") cause the "uncommon" timestamp case in 
> ce_match_stat(), where we check the index-file timestamp against the 
> timestamp of the stat data, to trigger.
>
> Not very good. The "ce_modified_check_fs()" tests can be quite expensive 
> if you have lots of those files because we end up then calling the 
> "ce_compare_data()" function a lot. And suddenly "git diff" doesn't take a 
> tenth of a second any more.
>
> We should really try to have some way to re-generate the index 
> automatically when this case triggers, so that we only need to do it 
> _once_ rather than keep doing it forever while the index is "potentially 
> stale".
>
> Any ideas?

Here is what I came up with.  I am not very happy with its
magic, but should be better than nothing.

 read-cache.c |   44 ++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index f92cdaa..ce76c20 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -5,6 +5,7 @@
  */
 #include "cache.h"
 #include "cache-tree.h"
+#include <time.h>
 
 /* Index extensions.
  *
@@ -923,7 +924,7 @@ static void ce_smudge_racily_clean_entry
 		 * $ echo filfre >nitfol
 		 * $ git-update-index --add nitfol
 		 *
-		 * but it does not.  Whe the second update-index runs,
+		 * but it does not.  When the second update-index runs,
 		 * it notices that the entry "frotz" has the same timestamp
 		 * as index, and if we were to smudge it by resetting its
 		 * size to zero here, then the object name recorded
@@ -945,7 +946,9 @@ int write_cache(int newfd, struct cache_
 {
 	SHA_CTX c;
 	struct cache_header hdr;
-	int i, removed;
+	int i, removed, recent;
+	struct stat st;
+	time_t now;
 
 	for (i = removed = 0; i < entries; i++)
 		if (!cache[i]->ce_mode)
@@ -959,15 +962,19 @@ int write_cache(int newfd, struct cache_
 	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
 		return -1;
 
+	now = fstat(newfd, &st) ? 0 : st.st_mtime;
+	recent = 0;
 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
+		time_t entry_time = (time_t) ntohl(ce->ce_mtime.sec);
 		if (!ce->ce_mode)
 			continue;
-		if (index_file_timestamp &&
-		    index_file_timestamp <= ntohl(ce->ce_mtime.sec))
+		if (index_file_timestamp && index_file_timestamp <= entry_time)
 			ce_smudge_racily_clean_entry(ce);
 		if (ce_write(&c, newfd, ce, ce_size(ce)) < 0)
 			return -1;
+		if (now && now <= entry_time)
+			recent++;
 	}
 
 	/* Write extension data here */
@@ -983,5 +990,34 @@ int write_cache(int newfd, struct cache_
 			return -1;
 		}
 	}
+
+	/*
+	 * To prevent later ce_match_stat() from always falling into
+	 * check_fs(), if we have too many entries that can trigger
+	 * racily clean check, we are better off delaying the return.
+	 * We arbitrarily say if more than 20 paths or 25% of total
+	 * paths are very new, we delay the return until the index
+	 * file gets a new timestamp.
+	 *
+	 * NOTE! NOTE! NOTE!
+	 *
+	 * This assumes that nobody is touching the working tree while
+	 * we are updating the index.
+	 */
+	if (20 < recent || entries <= recent * 4) {
+		now = fstat(newfd, &st) ? 0 : st.st_mtime;
+		while (now && !fstat(newfd, &st) && st.st_mtime <= now) {
+			struct timespec rq, rm;
+			off_t where = lseek(newfd, 0, SEEK_CUR);
+			rq.tv_sec = 0;
+			rq.tv_nsec = 250000000;
+			nanosleep(&rq, &rm);
+			if ((where == (off_t) -1) ||
+			    (write(newfd, "", 1) != 1) ||
+			    (lseek(newfd, -1, SEEK_CUR) != where) ||
+			    ftruncate(newfd, where))
+				break;
+		}
+	}
 	return ce_flush(&c, newfd);
 }
-- 
1.4.2.rc3.g19a8a


-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html