Re: [RFC] Speed up "git status" by caching untracked file info

Duy Nguyen <pclouds@xxxxxxxxx> · Wed, 23 Apr 2014 07:52:46 +0700

On Wed, Apr 23, 2014 at 1:56 AM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote:
> Am 22.04.2014 12:35, schrieb Duy Nguyen:
>> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclouds@xxxxxxxxx> wrote:
>>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
>>>
>>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.
>>
>> Well name-hash is only used when core.ignorecase is set. So it's
>> optional.
>
> This is only true for the case-insensitive directory hash. The file hash ('cache_file_exists') is always used to skip expensive excluded checks for tracked files.
>
> 'cache_file_exists' basically treats faster lookups for higher setup costs, which makes perfect sense when scanning the entire work tree. However, if most of the directory info is cached and just a few directories need refresh (and core.ignorecase=false), binary search ('cache_name_pos') may be better. The difficulty is to decide when to choose one over the other :-)

Right. The problem is even if untracked cache is used, we don't know
in advance how cache_file_exists calls we need to make. If .gitignore
changes, we could see how many directories are invalidated recursively
and that could be an indicator for favoring cache_file_exists over
cache_name_pos. It's harder when dir mtime changes, I suppose we could
be optimistic and stick to cache_name_pos until the number of calls
gets over a limit and turn to cache_file_exists. May backfire though..

>
>> Maybe we could save it in a separate index extension, but we
>> need to verify that the reader uses the same hash function as the
>> writer.
>>
>>>> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
>>>
>>> I missed this. We could ignore check_only if caching is enabled, but
>>> that does not sound really good. Let me think about it more..
>>
>> We could save "check_only" to the cache as well. This way we don't
>> have to disable the check_only trick completely.
>>
>> So we process a directory with check_only set, find one untracked
>> entry and stop short. We store check_only value and the status ("found
>> something") in addition to dir mtime. Next time we check the dir's
>> mtime. If it matches and is called with check_only set, we know there
>> is at least one untracked entry, that's enough to stop r_d_r and
>> return early. If dir mtime does not match, or r_d_r is called without
>> check_only, we ignore the cached data and fall back to opendir.
>>
>> Sounds good?
>>
>
> What about untracked files in sub-directories? E.g. you have untracked dirs a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as untracked.
> Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'.
>
> Your solution could work if you additionally cache the directories that had to be scanned to find that first untracked file (but you probably had that in mind anyway).

Basically all directories that are touched by r_d_r() will be cached.

> If the cache is only used for certain dir_struct.flags combinations, you can probably get around saving the check_only flag (which can only ever be true if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set (which is the default for 'git status')).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html