Re: git commit -a reports untracked files after a clone

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 05, 2011 at 02:26:30PM +0000, Joerg Rosenkranz wrote:

> > On Fri, May 27, 2011 at 02:00:45PM -0400, Jeff King wrote:
> >   1. We load the index, and for each entry, insert it into the index's
> >      name_hash. In addition, if ignorecase is turned on, we make an
> >      entry in the name_hash for the directory (e.g., "contrib/"), which
> >      uses the following code from 5102c61's hash_index_entry_directories:
> 
> Sorry for reactivating this old thread.
> We are running in this problem too. The behavior is the same in msysgit and on 
> Linux. Your patch resolves that problem for us.
> 
> Is there any chance to drive this patch forward?

Thanks for prodding. I wrote a big analysis at the end of that thread,
but didn't get much response. I've been meaning to pick it up again.

So I spent a few minutes revisiting the topic today. I think it's
definitely a bug, and the fix is definitely correct. The patch is below,
with what I hope is a coherent analysis adapted from the previous
thread.

The only question is whether or not it's OK to add a few bytes to
"struct cache_entry" to cater to an uncommon case (see the comments at
the end of the commit message). Junio might have thoughts on that.

-- >8 --
Subject: [PATCH] fix phantom untracked files when core.ignorecase is set

When core.ignorecase is turned on and there are stale index
entries, "git commit" can sometimes report directories as
untracked, even though they contain tracked files.

You can see an example of this with:

    # make a case-insensitive repo
    git init repo && cd repo &&
    git config core.ignorecase true &&

    # with some tracked files in a subdir
    mkdir subdir &&
    > subdir/one &&
    > subdir/two &&
    git add . &&
    git commit -m base &&

    # now make the index entries stale
    touch subdir/* &&

    # and then ask commit to update those entries and show
    # us the status template
    git commit -a

which will report "subdir/"  as untracked, even though it
clearly contains two tracked files. What is happening in the
commit program is this:

  1. We load the index, and for each entry, insert it into the index's
     name_hash. In addition, if ignorecase is turned on, we make an
     entry in the name_hash for the directory (e.g., "contrib/"), which
     uses the following code from 5102c61's hash_index_entry_directories:

        hash = hash_name(ce->name, ptr - ce->name);
        if (!lookup_hash(hash, &istate->name_hash)) {
                pos = insert_hash(hash, &istate->name_hash);
		if (pos) {
			ce->next = *pos;
			*pos = ce;
		}
        }

     Note that we only add the directory entry if there is not already an
     entry.

  2. We run add_files_to_cache, which gets updated information for each
     cache entry. It helpfully inserts this information into the cache,
     which calls replace_index_entry. This in turn calls
     remove_name_hash() on the old entry, and add_name_hash() on the new
     one. But remove_name_hash doesn't actually remove from the hash, it
     only marks it as "no longer interesting" (from cache.h):

      /*
       * We don't actually *remove* it, we can just mark it invalid so that
       * we won't find it in lookups.
       *
       * Not only would we have to search the lists (simple enough), but
       * we'd also have to rehash other hash buckets in case this makes the
       * hash bucket empty (common). So it's much better to just mark
       * it.
       */
      static inline void remove_name_hash(struct cache_entry *ce)
      {
              ce->ce_flags |= CE_UNHASHED;
      }

     This is OK in the specific-file case, since the entries in the hash
     form a linked list, and we can just skip the "not here anymore"
     entries during lookup.

     But for the directory hash entry, we will _not_ write a new entry,
     because there is already one there: the old one that is actually no
     longer interesting!

  3. While traversing the directories, we end up in the
     directory_exists_in_index_icase function to see if a directory is
     interesting. This in turn checks index_name_exists, which will
     look up the directory in the index's name_hash. We see the old,
     deleted record, and assume there is nothing interesting. The
     directory gets marked as untracked, even though there are index
     entries in it.

The problem is in the code I showed above:

        hash = hash_name(ce->name, ptr - ce->name);
        if (!lookup_hash(hash, &istate->name_hash)) {
                pos = insert_hash(hash, &istate->name_hash);
		if (pos) {
			ce->next = *pos;
			*pos = ce;
		}
        }

Having a single cache entry that represents the directory is
not enough; that entry may go away if the index is changed.
It may be tempting to say that the problem is in our removal
method; if we removed the entry entirely instead of simply
marking it as "not here anymore", then we would know we need
to insert a new entry. But that only covers this particular
case of remove-replace. In the more general case, consider
something like this:

  1. We add "foo/bar" and "foo/baz" to the index. Each gets
     their own entry in name_hash, plus we make a "foo/"
     entry that points to "foo/bar".

  2. We remove the "foo/bar" entry from the index, and from
     the name_hash.

  3. We ask if "foo/" exists, and see no entry, even though
     "foo/baz" exists.

So we need that directory entry to have the list of _all_
cache entries that indicate that the directory is tracked.
So that implies making a linked list as we do for other
entries, like:

  hash = hash_name(ce->name, ptr - ce->name);
  pos = insert_hash(hash, &istate->name_hash);
  if (pos) {
	  ce->next = *pos;
	  *pos = ce;
  }

But that's not right either. In fact, it shows a second bug
in the current code, which is that the "ce->next" pointer is
supposed to be linking entries for a specific filename
entry, but here we are overwriting it for the directory
entry. So the same cache entry ends up in two linked lists,
but they share the same "next" pointer.

As it turns out, this second bug can't be triggered in the
current code. The "if (pos)" conditional is totally dead
code; pos will only be non-NULL if there was an existing
hash entry, and we already checked that there wasn't one
through our call to lookup_hash.

But fixing the first bug means taking out that call to
lookup_hash, which is going to activate the buggy dead code,
and we'll end up splicing the two linked lists together.

So we need to have a separate next pointer for the list in
the directory bucket, and we need to traverse that list in
index_name_exists when we are looking up a directory.

This bloats "struct cache_entry" by a few bytes. Which is
annoying, because it's only necessary when core.ignorecase
is enabled. There's not an easy way around it, short of
separating out the "next" pointers from cache_entry entirely
(i.e., having a separate "cache_entry_list" struct that gets
stored in the name_hash). In practice, it probably doesn't
matter; we have thousands of cache entries, compared to the
millions of objects (where adding 4 bytes to the struct
actually does impact performance).

Signed-off-by: Jeff King <peff@xxxxxxxx>
---
 cache.h     |    1 +
 name-hash.c |   15 ++++++++-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 607c2ea..1334fbf 100644
--- a/cache.h
+++ b/cache.h
@@ -168,6 +168,7 @@ struct cache_entry {
 	unsigned int ce_flags;
 	unsigned char sha1[20];
 	struct cache_entry *next;
+	struct cache_entry *dir_next;
 	char name[FLEX_ARRAY]; /* more */
 };
 
diff --git a/name-hash.c b/name-hash.c
index c6b6a3f..225dd76 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -57,12 +57,10 @@ static void hash_index_entry_directories(struct index_state *istate, struct cach
 		if (*ptr == '/') {
 			++ptr;
 			hash = hash_name(ce->name, ptr - ce->name);
-			if (!lookup_hash(hash, &istate->name_hash)) {
-				pos = insert_hash(hash, ce, &istate->name_hash);
-				if (pos) {
-					ce->next = *pos;
-					*pos = ce;
-				}
+			pos = insert_hash(hash, ce, &istate->name_hash);
+			if (pos) {
+				ce->dir_next = *pos;
+				*pos = ce;
 			}
 		}
 	}
@@ -166,7 +164,10 @@ struct cache_entry *index_name_exists(struct index_state *istate, const char *na
 			if (same_name(ce, name, namelen, icase))
 				return ce;
 		}
-		ce = ce->next;
+		if (icase && name[namelen - 1] == '/')
+			ce = ce->dir_next;
+		else
+			ce = ce->next;
 	}
 
 	/*
-- 
1.7.7.37.g0e376

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]