Re: [PATCH v13 11/20] index-helper: use watchman to avoid refreshing index with lstat()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 1, 2016 at 1:54 AM, Ben Peart <peartben@xxxxxxxxx> wrote:
> I've found (at least on Windows) that as the repo size gets larger, the
> time to read the index becomes a much smaller percentage of the overall
> time.  I just captured some perf traces of git status on a large repo we
> have.  Of that, 92.5% was spent in git!read_directory and only 4.0% was
> spent in git!read_index.  Of that 4%, 2.6% was git!glk_SHA1_Update.

OK.. Windows. I don't have Windows numbers, but "git!read_directory"
(is that read_directory() in "git" process?) does not do a lot of
stat'ing. Although I agree that it dominates git-status. I recorded
some numbers in nd/untracked-cache topic (see 9e59724 (update-index:
manually enable or disable untracked cache - 2015-03-08)).

First step would be enabling that because besides directory
traversing, this code does a lot of string processing that's hopefully
eliminated by untracked cache extension. I cut git-status' time in
half with it. The side effect though, is that it creates a new flow of
stat(), one per directory. It would be great if you could do some
measurements with untracked cache on Windows and see if we get similar
gain.

> Given there were no dirty files, Watchman would have made a huge
> improvement in the overall time but index helper would have had
> relatively little impact.  We've noticed this same pattern in all our
> repos which is what is driving our interest in the Watchman model and
> also shows why index-helper is of less interest.

Assuming that untracked cache cuts git-status time by half on Windows
as well, then index read time now takes a bigger percentage, 8%, where
it starts to make sense to optimize it.

On a quiet repository, having watchman does not help so much because
we already reduce the significant number of filesystem-related system
calls. Yes there is lstat() and eliminating it may gain some more
(with watchman) and it matters on a repo with lots of directories. You
probably can just take these lstat out (I can help point out where)
and see how much the gain is.

Assuming (blindly again) that removing lstat helps like 10% of
read_index(), we would need to profile untracked cache code and see
where what we can do next. There are still a lot of directory
traversing there (except that it traverses the cache instead of
filesystem) and maybe we can do something. But I haven't gotten that
far.

> While the current design hides watchman behind index-helper, if we were
> to change that model so they were independent, we would be interested
> in doing it in such a way that provided some abstraction so that it
> could be replaced with another file watching daemon.

Frankly I'm not that interested in replacing another file watching
daemon. My first attempt at this problem was "file-watcher" which was
tied to Linux inotify system only and it would make sense to make it
replaceable. But watchman supports OS X, Linux, FreeBSD and Windows
now, not just Linux only as before, why another abstraction layer? You
could even replace "watchman.exe" binary. As long as you produce the
same json data, your new daemon should still work.

Tying index caching daemon and file watching daemon (let's avoid
"watchman" for now) gives us a bonus. Because both git and the caching
daemon know that they read the same index, we could answer the
question "what files are dirty?" with "file number 1, 5, 9 in the
index" instead of sending full paths and git has to make some more
lookups to identify them. In this series we send it as a compressed
bit map. Probably not a big deal in terms of real performance, but it
feels nice to do lookups less.

The second reason is because watchman daemon alone does not provide
enough information to reduce untracked cache's lstat() as much as
possible. The current approach in this series is a naive one, which
works for some cases, but not optimal (I'll get to that). We need a
separate long-running daemon to maintain extra info to reduce lstat().
Because our target is watchman, it does not make sense to add yet
another daemon besides index-helper to do this.

OK the optimal lstat() reduction thing. Right now, if any file in a
directory is updated, the directory is invalidated in untracked cache
and we need to traverse it to collect excluded files again. But it
does not have to be that way. We don't care if any file is _updated_
because it will not change untracked cache output. We care about what
files are _added_ or _deleted_. New files will need to be classified
as either tracked, untracked or ignored. Deleted files may invalid
either three file lists. Watchman cannot answer "what files are added
or deleted since the point X in time" and I agree that it's not
watchman's job (watchman issue 65). So we have to maintain some more
info by ourselves, e.g. the list of files at any requested "clock".
With that we can compare the file lists of two "clock"s and tell git
what files are added or deleted.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]