On Wed, Feb 13, 2013 at 07:15:47PM +0700, Nguyen Thai Ngoc Duy wrote: > On Wed, Feb 13, 2013 at 3:48 AM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote: > > 2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.: > > This function is only used when core.ignorecase = true. I probably > won't be able to test this, so I'll leave this to other people who > care about ignorecase. > > This function used to have lookup_hash, but it was removed by Jeff in > 2548183 (fix phantom untracked files when core.ignorecase is set - > 2011-10-06). There's a looong commit message which I'm too lazy to > read. Anybody who works on this should though. Yeah, the problem that commit tried to solve is that linking to a single cache entry through the hash is not enough, because we may remove cache items. Imagine you have "dir/one" and "dir/two", and you add them to the in-memory index in that order. The original code hashed "dir/" and inserted a link to the "dir/one" cache entry. When it came time to put in the "dir/two" entry, we noticed that there was already a "dir/" entry and did nothing. Then later, if we remove "dir/one", we do so by marking it with CE_UNHASHED. So a later query for "dir/" will see "nope, nothing here that wasn't CE_UNHASHED", which is wrong. We never recorded that "dir/two" existed under the hash for "dir/", so we can't know about it. My patch just stores the cache_entry for both under the "dir/" hash. As Karsten noticed, that can lead to a large number of hash entries, because adding "some/deep/hierarchy/with/files" will add 4 directory entries for just that single file. Moreover, looking at it again, I don't think my patch produces the right behavior: we have a single dir_next pointer, even though the same ce_entry may appear under many directory hashes. So the cache_entries that has to "dir/foo/" and those that hash to "dir/bar/" may get confused, because they will also both be found under "dir/", and both try to create a linked list from the dir_next pointer. Looking at Karsten's patch, it seems like it will not add a cache entry if there is one of the same name. But I'm not sure if that is right, as the old one might be CE_UNHASHED (or it might get removed later). You actually want to be able to find each cache_entry that has a file under the directory at the hash of that directory, so you can make sure it is still valid. And of course that still leaves the existing correctness problem I mentioned above. I think the best way forward is to actually create a separate hash table for the directory lookups. I note that we only care about these entries in directory_exists_in_index_icase, which is really about whether something is there, versus what exactly is there. So could we maybe get by with a separate hash table that stores a count of entries at each directory, and increment/decrement the count when we add/remove entries? The biggest problem I see with that is that we do indeed care a little bit what is at the directory: we check the mode to see if it is a gitdir or not. But I think we can maybe sneak around that: gitdirs have actual entries in the index, whereas the directories do not. So we would find them via index_name_exists; anything that is not there, but _is_ in the special directory hash would therefore be a directory. I realize it got pretty esoteric there in the middle. I'll see if I can work up a patch that expresses what I'm thinking. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html