On Fri, Jul 1, 2016 at 1:54 AM, Ben Peart <peartben@xxxxxxxxx> wrote: > I've found (at least on Windows) that as the repo size gets larger, the > time to read the index becomes a much smaller percentage of the overall > time. I just captured some perf traces of git status on a large repo we > have. Of that, 92.5% was spent in git!read_directory and only 4.0% was > spent in git!read_index. Of that 4%, 2.6% was git!glk_SHA1_Update. OK.. Windows. I don't have Windows numbers, but "git!read_directory" (is that read_directory() in "git" process?) does not do a lot of stat'ing. Although I agree that it dominates git-status. I recorded some numbers in nd/untracked-cache topic (see 9e59724 (update-index: manually enable or disable untracked cache - 2015-03-08)). First step would be enabling that because besides directory traversing, this code does a lot of string processing that's hopefully eliminated by untracked cache extension. I cut git-status' time in half with it. The side effect though, is that it creates a new flow of stat(), one per directory. It would be great if you could do some measurements with untracked cache on Windows and see if we get similar gain. > Given there were no dirty files, Watchman would have made a huge > improvement in the overall time but index helper would have had > relatively little impact. We've noticed this same pattern in all our > repos which is what is driving our interest in the Watchman model and > also shows why index-helper is of less interest. Assuming that untracked cache cuts git-status time by half on Windows as well, then index read time now takes a bigger percentage, 8%, where it starts to make sense to optimize it. On a quiet repository, having watchman does not help so much because we already reduce the significant number of filesystem-related system calls. Yes there is lstat() and eliminating it may gain some more (with watchman) and it matters on a repo with lots of directories. You probably can just take these lstat out (I can help point out where) and see how much the gain is. Assuming (blindly again) that removing lstat helps like 10% of read_index(), we would need to profile untracked cache code and see where what we can do next. There are still a lot of directory traversing there (except that it traverses the cache instead of filesystem) and maybe we can do something. But I haven't gotten that far. > While the current design hides watchman behind index-helper, if we were > to change that model so they were independent, we would be interested > in doing it in such a way that provided some abstraction so that it > could be replaced with another file watching daemon. Frankly I'm not that interested in replacing another file watching daemon. My first attempt at this problem was "file-watcher" which was tied to Linux inotify system only and it would make sense to make it replaceable. But watchman supports OS X, Linux, FreeBSD and Windows now, not just Linux only as before, why another abstraction layer? You could even replace "watchman.exe" binary. As long as you produce the same json data, your new daemon should still work. Tying index caching daemon and file watching daemon (let's avoid "watchman" for now) gives us a bonus. Because both git and the caching daemon know that they read the same index, we could answer the question "what files are dirty?" with "file number 1, 5, 9 in the index" instead of sending full paths and git has to make some more lookups to identify them. In this series we send it as a compressed bit map. Probably not a big deal in terms of real performance, but it feels nice to do lookups less. The second reason is because watchman daemon alone does not provide enough information to reduce untracked cache's lstat() as much as possible. The current approach in this series is a naive one, which works for some cases, but not optimal (I'll get to that). We need a separate long-running daemon to maintain extra info to reduce lstat(). Because our target is watchman, it does not make sense to add yet another daemon besides index-helper to do this. OK the optimal lstat() reduction thing. Right now, if any file in a directory is updated, the directory is invalidated in untracked cache and we need to traverse it to collect excluded files again. But it does not have to be that way. We don't care if any file is _updated_ because it will not change untracked cache output. We care about what files are _added_ or _deleted_. New files will need to be classified as either tracked, untracked or ignored. Deleted files may invalid either three file lists. Watchman cannot answer "what files are added or deleted since the point X in time" and I agree that it's not watchman's job (watchman issue 65). So we have to maintain some more info by ourselves, e.g. the list of files at any requested "clock". With that we can compare the file lists of two "clock"s and tell git what files are added or deleted. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html