Re: [RFC] Speed up "git status" by caching untracked file info

Duy Nguyen <pclouds@xxxxxxxxx> · Tue, 22 Apr 2014 17:13:40 +0700

On Tue, Apr 22, 2014 at 4:56 PM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote:
> Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy:
>> This patch serves as a heads up about a feature I'm working on. I hope
>> that by posting it early, people could double check if I have made
>> some fundamental mistakes that completely ruin the idea. It's about
>> speeding up "git status" by caching untracked file info in the index
>> _if_ your file system supports it (more below).
>>
>> The whole WIP series is at
>>
>> https://github.com/pclouds/git/commits/untracked-cache
>>
>> I only post the real meat here. I'm aware of a few incomplete details
>> in this patch, but nothing fundamentally wrong. So far the numbers are
>> promising.  ls-files is updated to run fill_directory() twice in a
>> row and "ls-files -o --directory --no-empty-directory --exclude-standard"
>> (with gcc -O0) gives me:
>>
>>            first run  second (cached) run
>> gentoo-x86    500 ms             71.6  ms
>> wine          140 ms              9.72 ms
>> webkit        125 ms              6.88 ms
>
> IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run?

No I didn't. Thanks for pointing it out. I'll see if I can reduce its time.

>> The following inputs are sufficient to determine what files in a
>> directory are excluded:
>>
>>  - The list of files and directories of the direction in question
>>  - The $GIT_DIR/index
>>  - The content of $GIT_DIR/info/exclude
>>  - The content of core.excludesfile
>>  - The content (or the lack) of .gitignore of all parent directories
>>    from $GIT_WORK_TREE
>>
>
> The dir_struct.flags also play a big role in evaluation of read_directory.
>
> E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option:

Yeah. dir_struct.flags will be part of the input. I intend to optimize
"git status" case only, so if it matches the recorded
dir_struct.flags, the cache is used. Else the cache is ignored.
Caching --ignored is not so interesting, because the list of ignored
files could be huge, while untracked listing is usually small.

>> @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>>                       break;
>>
>>               case path_untracked:
>> -                     if (!(dir->flags & DIR_SHOW_IGNORED))
>> -                             dir_add_name(dir, path.buf, path.len);
>> +                     if (dir->flags & DIR_SHOW_IGNORED)
>> +                             break;
>> +                     dir_add_name(dir, path.buf, path.len);
>> +                     if (cdir.fdir)
>> +                             add_untracked(untracked, path.buf + baselen);
>>                       break;
>
> Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well...

I missed this. We could ignore check_only if caching is enabled, but
that does not sound really good. Let me think about it more..

>
> I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing.

The core code path is still shared though, or we would duplicate r_d_r
entirely for caching recording, which sounds like a maintenance
nightmare.

>> At the implementation level, the whole directory structure is saved,
>> each directory corresponds to one struct untracked_dir.
>
> With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO.

It's actually just "directories". We may need to store both tracked
and untracked directories. Maybe renaming it to cached_dir..

>> So if all is really well, read_directory() becomes a series of
>> open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and
>> stat(<dir>) _without_ heavyweight exclude filtering. There should be
>> no overhead if this feature is disabled.
>>
>
> Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)?

That's a further optimization. With the current code it's simpler to
open .gitignore. Assume you have a path a/b/c. a/.gitignore's stat
info is good, so you skip opening it. Then you find a/b/.gitignore is
modified and you need to recompute untracked files in a/b. To do that
you need a/.gitignore as well. Lazily opening a/.gitignore at this
stage is possible, but trickier (you have to make sure the rules are
in correct order because of negative patterns).

Anyway, the number of .gitignore files is usually small. We can
already avoid opening non-existent .gitignore, which is proportional
to the number of directories.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html