Am 17.04.2014 07:51, schrieb Nguyễn Thái Ngọc Duy: > This patch serves as a heads up about a feature I'm working on. I hope > that by posting it early, people could double check if I have made > some fundamental mistakes that completely ruin the idea. It's about > speeding up "git status" by caching untracked file info in the index > _if_ your file system supports it (more below). > > The whole WIP series is at > > https://github.com/pclouds/git/commits/untracked-cache > > I only post the real meat here. I'm aware of a few incomplete details > in this patch, but nothing fundamentally wrong. So far the numbers are > promising. ls-files is updated to run fill_directory() twice in a > row and "ls-files -o --directory --no-empty-directory --exclude-standard" > (with gcc -O0) gives me: > > first run second (cached) run > gentoo-x86 500 ms 71.6 ms > wine 140 ms 9.72 ms > webkit 125 ms 6.88 ms IIRC name_hash.c::lazy_init_name_hash took ~100ms on my system, so hopefully you did a dummy 'cache_name_exists("anything")' before starting the measurement of the first run? > linux-2.6 106 ms 16.2 ms > > Basically untracked time is cut to one tenth in the best case > scenario. The final numbers would be a bit higher because I haven't > stored or read the cache from index yet. Real commit message follows.. > > > read_directory() plays a bit part in the slowness of "git status" > because it has to read every directory and check for excluded entries, > which is really expensive. This patch adds an option to cache the > results so that after the first slow read_directory(), the following > calls should be cheap and fast. > > The following inputs are sufficient to determine what files in a > directory are excluded: > > - The list of files and directories of the direction in question > - The $GIT_DIR/index > - The content of $GIT_DIR/info/exclude > - The content of core.excludesfile > - The content (or the lack) of .gitignore of all parent directories > from $GIT_WORK_TREE > The dir_struct.flags also play a big role in evaluation of read_directory. E.g. it seems untracked files are not properly recorded if the cache is filled with '--ignored' option: > @@ -1360,15 +1603,18 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir, > break; > > case path_untracked: > - if (!(dir->flags & DIR_SHOW_IGNORED)) > - dir_add_name(dir, path.buf, path.len); > + if (dir->flags & DIR_SHOW_IGNORED) > + break; > + dir_add_name(dir, path.buf, path.len); > + if (cdir.fdir) > + add_untracked(untracked, path.buf + baselen); > break; Similarly, the '--directory' option controls early returns from the directory scan (via read_directory_recursive's check_only argument), so you won't be able to get a full untracked files listing if the cache was recorded with '--directory'. Additionally, '--directory' aggregates the state at the topmost untracked directory, so that directory's cached state depends on all sub-directories as well... I wonder if it makes sense to separate cache recording logic from read_directory_recursive and friends, which are mainly concerned with flags processing. > If we can cheaply validate all those inputs for a certain directory, > we are sure that the current code will always produce the same > results, so we can cache and reuse those results. > > This is not a silver bullet approach. When you compile a C file, for > example, the old .o file is removed and a new one with the same name > created, effectively invalidating the containing directory's > cache. But at least with a large enough work tree, there could be many > directories you never touch. The cache could help there. > > The first input can be checked using directory mtime. In many > filesystems, directory mtime is updated when direct files/dirs are > added or removed (*). If you do not use such a file system, this > feature is not for you. > > The second one can be hooked from read-cache.c. Whenever a file (or a > submodule) is added or removed from a directory, we invalidate that > directory. This will be done in a later patch. > > The remaining inputs are easy, their SHA-1 could be used to verify > their contents. We do need to read .gitignore files and digest > them. But they are usually few and small, so the overhead should not > be much. > > At the implementation level, the whole directory structure is saved, > each directory corresponds to one struct untracked_dir. With the usual options (e.g. standard 'git status'), untracked directories are mostly skipped, so the cache would mostly store tracked directories. Naming it 'struct untracked_dir' is a bit confusing, IMO. > Each directory > holds SHA-1 of the .gitignore underneath (or null if it does not > exist) and the list of untracked "files" and subdirs that need to > recurse into if all is well. Untracked subdirectories are saved in the > file queue and are the reason of quoting "files" in the previous > sentence. > > On the first run, no untracked_dir is valid, the default code path is > run. prep_exclude() is updated to record SHA-1 of .gitignore along the > way. read_directory_recursive() is updated to record untracked files. > > On subsequent runs, read_directory_recursive() reads stat info of the > directory in question and verifies if files/dirs have been added or > removed. With the help of prep_exclude() to verify .gitignore chain, > it may decide "all is well" and enable the fast path in > treat_path(). read_directory_recursive() is still called for > subdirectories even in fast path, because a directory mtime does not > cover all subdirs recursively. > > So if all is really well, read_directory() becomes a series of > open(".gitignore"), read(".gitignore"), close(), hash_sha1_file() and > stat(<dir>) _without_ heavyweight exclude filtering. There should be > no overhead if this feature is disabled. > Wouldn't mtime of .gitignore files suffice here (so you don't need to open and parse them every time)? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html