Re: inotify to minimize stat() calls

Jeff King <peff@xxxxxxxx> · Wed, 13 Feb 2013 13:18:51 -0500

On Wed, Feb 13, 2013 at 07:15:47PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Wed, Feb 13, 2013 at 3:48 AM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote:
> > 2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.:
> 
> This function is only used when core.ignorecase = true. I probably
> won't be able to test this, so I'll leave this to other people who
> care about ignorecase.
> 
> This function used to have lookup_hash, but it was removed by Jeff in
> 2548183 (fix phantom untracked files when core.ignorecase is set -
> 2011-10-06). There's a looong commit message which I'm too lazy to
> read. Anybody who works on this should though.

Yeah, the problem that commit tried to solve is that linking to a single
cache entry through the hash is not enough, because we may remove cache
items. Imagine you have "dir/one" and "dir/two", and you add them to the
in-memory index in that order. The original code hashed "dir/" and
inserted a link to the "dir/one" cache entry. When it came time to put
in the "dir/two" entry, we noticed that there was already a "dir/" entry
and did nothing. Then later, if we remove "dir/one", we do so by marking
it with CE_UNHASHED. So a later query for "dir/" will see "nope, nothing
here that wasn't CE_UNHASHED", which is wrong. We never recorded that
"dir/two" existed under the hash for "dir/", so we can't know about it.

My patch just stores the cache_entry for both under the "dir/" hash.
As Karsten noticed, that can lead to a large number of hash entries,
because adding "some/deep/hierarchy/with/files" will add 4 directory
entries for just that single file. Moreover, looking at it again, I
don't think my patch produces the right behavior: we have a single
dir_next pointer, even though the same ce_entry may appear under many
directory hashes. So the cache_entries that has to "dir/foo/" and those
that hash to "dir/bar/" may get confused, because they will also both be
found under "dir/", and both try to create a linked list from the
dir_next pointer.

Looking at Karsten's patch, it seems like it will not add a cache entry
if there is one of the same name. But I'm not sure if that is right, as
the old one might be CE_UNHASHED (or it might get removed later). You
actually want to be able to find each cache_entry that has a file under
the directory at the hash of that directory, so you can make sure it is
still valid.

And of course that still leaves the existing correctness problem I
mentioned above.

I think the best way forward is to actually create a separate hash table
for the directory lookups. I note that we only care about these entries
in directory_exists_in_index_icase, which is really about whether
something is there, versus what exactly is there. So could we maybe get
by with a separate hash table that stores a count of entries at each
directory, and increment/decrement the count when we add/remove entries?

The biggest problem I see with that is that we do indeed care a little
bit what is at the directory: we check the mode to see if it is a gitdir
or not. But I think we can maybe sneak around that: gitdirs have actual
entries in the index, whereas the directories do not. So we would find
them via index_name_exists; anything that is not there, but _is_ in the
special directory hash would therefore be a directory.

I realize it got pretty esoteric there in the middle. I'll see if I can
work up a patch that expresses what I'm thinking.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html