Re: [PATCH v3] merge: fix cache_entry use-after-free

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 19 Oct 2015 19:13:51 -0700

David Turner <dturner@xxxxxxxxxxxxxxxx> writes:

> The problem is not the name_hash, but with the dir_hash.  The dir_hash
> is only even populated on case-insensitive filesystems (which is why you
> and Linus don't see this bug).  The dir_hash is not designed to catch
> d/f conflicts, but rather case conflicts (one side of a merge has
> foo/bar; the other has FOO/bar).  For each directory, it maintains a
> pointer to a cache_entry.  Then add_to_index uses that cache_entry to 
> rewrite incoming entries' parent directories to the expected case.
> Weirdly, that part of add_to_index is, so far as I can tell, never
> called during a merge.  So where's are we using the freed value?
> Probably in dir_entry_cmp, while adding another entry to the hashmap;
> that's where our crash seems to have happened.  And our failure depended
> on the details of the hash function as well; if a certain collision did
> not happen, we would not see it.
>
> There is, I think, another way to handle this: we could *copy* the path
> into the struct dir_entry instead of pointing to a cache_entry.  But
> this seems like a bunch of extra work; the refcounting is simpler.  

This codepath is a mess.  Whoever wrote hash_dir_entry() seems to
have already known that the code is buggy by leaving this comment
there:

 * Note that the cache_entry stored with the dir_entry merely
 * supplies the name of the directory (up to dir_entry.namelen). We
 * track the number of 'active' files in a directory in dir_entry.nr,
 * so we can tell if the directory is still relevant, e.g. for git
 * status. However, if cache_entries are removed, we cannot pinpoint
 * an exact cache_entry that's still active. It is very possible that
 * multiple dir_entries point to the same cache_entry.

If you add "a/frotz" to the index, the code will somehow create a
dir_entry to represent that the canonical case for directory "a/" is
such, so that when you later try to add "A/nitfol", the code can
grab the canonical path "a/" from dir_entry hash and turn it into
"a/nitfol".  That is what happens in add_to_index in read-cache.c

But what happens if, after adding "a/frotz" and "a/nitfol" to the
index like that, you remove "a/frotz"?  I do not see any code that
notices the poitner to "a/frotz" will become stale and replace it
with a pointer to "a/nitfol" that is still in the system.  The next
time you try to add "a/xyzzy", find_dir_entry() will try to see if
there is an existing entry that matches "a/xyzzy"'s directory, and
one of the dir_entries has a stale pointer to ce for "a/frotz" that
is already gone.

I think the root cause of the bug is the external interface to the
index_dir_exists() function.  For the above processing, there is no
reason for dir_entry() to hold a pointer to ce for "a/frotz" or
"a/nitfol".  All it needs to have is an embedded string in the
structure itself, i.e.

        struct dir_entry {
                struct hashmap_entry ent;
                struct dir_entry *parent;
                int nr;
                unsigned int namelen;
                char name[FLEX_ARRAY];
        };

The caller of index_dir_exists() in add_to_index() can be adjusted
to work with a new function in name-hash.c that does this part:

	const char *startPtr = ce->name;
	const char *ptr = startPtr;
	while (*ptr) {
		while (*ptr && *ptr != '/')
			++ptr;
		if (*ptr == '/') {
			struct cache_entry *foundce;
			++ptr;
			foundce = index_dir_exists(istate, ce->name, ptr - ce->name - 1);
			if (foundce) {
				memcpy((void *)startPtr, foundce->name + (startPtr - ce->name), ptr - startPtr);
				startPtr = ptr;
			}
		}
	}

Perhaps name it "adjust_dirname_case(istate, ce->name)" and have it
do the whole while loop above, all inside name-hash.c.  That would
make this caller a lot easier to read and understand what is going
on.

The other one, directory_exists_in_index_icase() in dir.c, expects
that a ce is returned, but the way it uses the returned value does
not really require the function to return a ce.  It does look at the
ce->ce_mode but that is only because the way index_dir_exists()
tells its caller if there is a directory (hence some files inside
it) or if there is a submodule (hence there is nothing below it) by
making a fallback call internally to index_file_exists() and returns
the ce for gitlink only when (1) there is no directory in dir_hash
and (2) file_hash as a submodule in that path.

A more cleaner helper function to give what this caller really wants
to know that name-hash.c can provide would be a function that takes
pathname and len and returns an enum: "there is nothing there",
"there is a directory", or "there is a submodule".  For completeness
of the API, even though this sole caller does not need it, we could
throw in "there is a regular file" there, too, if we wanted to.

If I were to clean this up, I would probably use the above updated
definition of struct dir_entry and:

 * drop the fallback "return gitlink if there is one" from
   index_dir_exists(), and make index_dir_exists() return 0 (false)
   or 1 (true).

 * have directory_exists_in_index() first call the updated
   index_dir_exists() via the cache_dir_exists() convenience macro.
   If it returned true, it should return index_direcgtory.  If it
   returned false, make a call to cache_file_exists() to see if
   there is a gitlink and rturn index_gitdir when that is the case.
   When all of the above fails, return index_nonexistent.

That way, we do not have to worry about "this ce no longer exists in
the main index but we need to keep it around", saving 8-bytes from
each ce and refcounting mess in the code.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html