Hi Patrick, On Wed, Feb 9, 2022 at 2:42 AM Patrick Marlier <patrick.marlier@xxxxxxxxx> wrote: > > Dear git Developers, > > In a big repository with a lot of untracked directories and files > (build in tree), "git clean -ffdx" can be optimized. Indeed, "git > clean" goes recursively into all untracked and nested directories to > look for .git files even if "-ff" is specified. Yeah, seems like a waste of work. Thanks for digging in. > Using breakpoint on stat or "strace -e newfstatat", it is possible to > see the recursing search for ".git" and ".git/HEAD". Also it seems to > traverse the untracked directories a few times, which I am not sure > why. There are two steps -- collect the list of things that are removable, and then a separate step to remove the things that are removable. It can be the case that we recurse into the same directory twice, once for each step. For the first step, see the fill_directory() call. If you left off "-x" or only had one "-f", or your directories had some tracked files, then fill_directory()'s job is finding out which things are removable and it'd likely only be a subset of those directories. For the second step, some of those removable things may be entire directories. So when it hits one of those and later calls remove_dirs() to remove it, it has to recurse again. > Using "-ff" should not check for nested .git and no need to recurse if > the directory is already untracked. Not quite; the -x and lack of -e options is critical here, otherwise we cannot just nuke the whole directory. (Also, -d is important, but that just kind of goes without saying.) > Doing the following, it seems to avoid looking for nested .git and all > tests are passing. > > @@ -1007,6 +1008,12 @@ int cmd_clean(int argc, const char **argv, > const char *prefix) > * the code clearer to exclude it, though. > */ > dir.flags |= DIR_KEEP_UNTRACKED_CONTENTS; > + > + /* > + * No need to go to deeper in directories if already untracked > + */ > + if (rm_flags == 0) > + dir.flags |= DIR_NO_GITLINKS; > } > > if (read_cache() < 0) > > However reading the documentation of DIR_NO_GITLINKS seems to say that > is not the right fix. I think DIR_NO_GITLINKS has an unfortunately poor description. The fill_directory() API has many different callers who use it differently, see commit 8d92fb2927 ("dir: replace exponential algorithm with a linear one", 2020-04-01) for some description of this. I often ran into problems where descriptions of variables and items in the documentation tended to focus on one of those modes in such a way as to make the other modes that also called in be confusing. I think the author of the current text was just thinking of one of those other callers, probably 'git status'. Looking at the code, it's intended purpose is just what you're using it for. (Also, I probably should have documented DIR_SKIP_NESTED_GIT when I added it in commit 09487f2cba ("clean: avoid removing untracked files in a nested git repository", 2019-09-17); the fact that it is similar and undocumented isn't helping matters, although at the time I didn't know there was documentation for any of the flags given that they were in an entirely separate file.) Anyway, documentation issues aside, I think this is the flag you want to use, but this is not the right place to set it. I think you should set it where rm_flags is set to 0 instead. Further, setting this flag is only solving part of the performance problem. We'll still recurse into the directories to look for ignored files, which should be avoided since we don't need to differentiate. That basically means that we need to avoid all the code in the current if (remove_directories && !ignored_only) block, whenever (remove_directories && ignored && !exclude_list.nr && force > 1). > Another thing to note is that it shows "Removing XXX" but it shows it > when the directory is already gone. So we could change to "Removed > XXX" or display the "Removing XXX" before starting to remove the > directory. I commented on Bagas' patch, but I think this is minutiae that won't be relevant to the end user, and would rather either ignore it or just move the print statement earlier rather than increasing work for translators. That is, unless we tend to use past tense elsewhere in the UI and we want to make a concerted effort to convert to using past tense. > Thanks in advance for any fix or help in getting it right. You've clearly taken the time to investigate, and you found the basic solution too. That's pretty good, especially considering that you're dealing with the dragons in dir.[ch]. We could potentially have 1-3 patches here: * avoiding the unnecessary checks for is_nonbare_repository_dir() via setting DIR_NO_GITLINKS. This is basically your change, just moved slightly. * avoiding the unnecessary attempts to differentiate untracked and ignored via avoiding the "if (remove_directories && !ignored_only)" code block when appropriate, as highlighted above * fixing up the documentation for DIR_NO_GITLINKS and adding some for DIR_SKIP_NESTED_GIT. You've already done most the work for the first one, you'd just need to move the line elsewhere and add a commit message (though adding a testcase might be nice too). Are you up for that? Would you also like to tackle either of the other two items? If you combine the first and the second, you might find be able to generate a good testcase by running GIT_TRACE2_PERF="$(pwd)/.git/trace.output" git clean -ffdx and then grepping the trace.output file for "directories-visited" and "paths-visited" and checking that the new flags indeed reduce both of those numbers. Anyway, I'll be happy to review your patch(es) and will take whichever parts you don't want to tackle. Thanks for digging in, reporting, and helping make Git better!