Re: [RFC] Speed up "git status" by caching untracked file info

Karsten Blees <karsten.blees@xxxxxxxxx> · Tue, 22 Apr 2014 20:56:27 +0200

Am 22.04.2014 12:35, schrieb Duy Nguyen:
> On Tue, Apr 22, 2014 at 5:13 PM, Duy Nguyen <pclouds@xxxxxxxxx> wrote:
>>> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?
>>
>> No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.
> 
> Well name-hash is only used when core.ignorecase is set. So it's
> optional.

This is only true for the case-insensitive directory hash. The file hash ('cache_file_exists') is always used to skip expensive excluded checks for tracked files. 

'cache_file_exists' basically treats faster lookups for higher setup costs, which makes perfect sense when scanning the entire work tree. However, if most of the directory info is cached and just a few directories need refresh (and core.ignorecase=false), binary search ('cache_name_pos') may be better. The difficulty is to decide when to choose one over the other :-)

> Maybe we could save it in a separate index extension, but we
> need to verify that the reader uses the same hash function as the
> writer.
> 
>>> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...
>>
>> I missed this. We could ignore check_only if caching is enabled, but
>> that does not sound really good. Let me think about it more..
> 
> We could save "check_only" to the cache as well. This way we don't
> have to disable the check_only trick completely.
> 
> So we process a directory with check_only set, find one untracked
> entry and stop short. We store check_only value and the status ("found
> something") in addition to dir mtime. Next time we check the dir's
> mtime. If it matches and is called with check_only set, we know there
> is at least one untracked entry, that's enough to stop r_d_r and
> return early. If dir mtime does not match, or r_d_r is called without
> check_only, we ignore the cached data and fall back to opendir.
> 
> Sounds good?
> 

What about untracked files in sub-directories? E.g. you have untracked dirs a/b with untracked file a/b/c, so normal 'git status' would list 'a/' as untracked.
Now, 'rm a/b/c' would update mtime of b, but not of a, so you'd still list 'a/' as untracked. Same thing for 'echo "c" >a/b/.gitignore'.

Your solution could work if you additionally cache the directories that had to be scanned to find that first untracked file (but you probably had that in mind anyway).

If the cache is only used for certain dir_struct.flags combinations, you can probably get around saving the check_only flag (which can only ever be true if both DIR_SHOW_OTHER_DIRECTORIES and DIR_HIDE_EMPTY_DIRECTORIES are set (which is the default for 'git status')).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html