This patch series builds on en/fill-directory-fixes-more. This series should be considered an RFC because of the untracked-cache changes (see the last two commits), for which I'm hoping to get an untracked-cache expert to comment. This series does provide some modest speedups (see second to last commit message), and should allow 'git status --ignored' to complete in a more reasonable timeframe for Martin Melka (see https://lore.kernel.org/git/CANt4O2L_DZnMqVxZzTBMvr=BTWqB6L0uyORkoN_yMHLmUX7yHw@xxxxxxxxxxxxxx/ ) Changes since v1: * Replaced patch 4 with improved version from Stolee (with additional improvement of my own) * Clarifications, wording fixes, and more about linear perf in commit message to patch 5 * More detail in patch 5 about why "whackamole" particularly makes me uneasy for dir.c Stuff clearly still missing from v2: * I didn't make the DIR_KEEP_UNTRACKED_CONTENTS changes I mentioned in https://lore.kernel.org/git/CABPp-BEQ5s=+6Rnb-A+pdEaoPXxfo-hMSegSe1eai=RE74A3Og@xxxxxxxxxxxxxx/ which I think would make the code cleaner & clearer. * I still have not addressed the untracked-cache issue mentioned in the last two commits. I looked at it very, very briefly, but I was really close to doing something similar to [1] and just dropping my patches in this series before even submitting them on Wednesday[2] (dir.c is a really unpleasant to work in). Other than wording fixes, I just need a week or two off from this area before I dig further, unless someone else wants to dive in and needs me to provide pointers on what I've done so far. [1] https://lore.kernel.org/git/pull.676.v3.git.git.1576571586.gitgitgadget@xxxxxxxxx/ [2] I was inches from doing that Wednesday morning. I had done several rounds of "Okay, I fixed all the tests that broke with my changes last time, let's re-run the testsuite -- wow, four totally different tests from testfiles I hadn't looked at before now break", and decided that I would only do one more before dropping it an maybe coming back in a month or two. That time happened to work, minus the untracked-cache, so I decided to put it in front of other eyeballs. Derrick Stolee (1): dir: refactor treat_directory to clarify control flow Elijah Newren (5): dir: consolidate treat_path() and treat_one_path() dir: fix broken comment dir: fix confusion based on variable tense dir: replace exponential algorithm with a linear one t7063: blindly accept diffs dir.c | 331 +++++++++++++++++------------- t/t7063-status-untracked-cache.sh | 50 ++--- 2 files changed, 208 insertions(+), 173 deletions(-) base-commit: 0cbb60574e741e8255ba457606c4c90898cfc755 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-700%2Fnewren%2Ffill-directory-exponential-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-700/newren/fill-directory-exponential-v2 Pull-Request: https://github.com/git/git/pull/700 Range-diff vs v1: 1: 27bc135796 = 1: 27bc135796 dir: consolidate treat_path() and treat_one_path() 2: 2ceb64ae61 = 2: 2ceb64ae61 dir: fix broken comment 3: e6d21228d1 = 3: e6d21228d1 dir: fix confusion based on variable tense 4: 3b2ec5eaf6 ! 4: f73f0d66d1 dir: move setting of nested_repo next to its actual usage @@ -1,26 +1,73 @@ -Author: Elijah Newren <newren@xxxxxxxxx> +Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx> - dir: move setting of nested_repo next to its actual usage + dir: refactor treat_directory to clarify control flow + The logic in treat_directory() is handled by a multi-case + switch statement, but this switch is very asymmetrical, as + the first two cases are simple but the third is more + complicated than the rest of the method. In fact, the third + case includes a "break" statement that leads to the block + of code outside the switch statement. That is the only way + to reach that block, as the switch handles all possible + values from directory_exists_in_index(); + + Extract the switch statement into a series of "if" statements. + This simplifies the trivial cases, while clarifying how to + reach the "show_other_directories" case. This is particularly + important as the "show_other_directories" case will expand + in a later change. + + Helped-by: Elijah Newren <newren@xxxxxxxxx> + Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> Signed-off-by: Elijah Newren <newren@xxxxxxxxx> diff --git a/dir.c b/dir.c --- a/dir.c +++ b/dir.c @@ - const char *dirname, int len, int baselen, int excluded, const struct pathspec *pathspec) { -- int nested_repo = 0; -+ int nested_repo; - + int nested_repo = 0; +- /* The "len-1" is to strip the final '/' */ - switch (directory_exists_in_index(istate, dirname, len-1)) { -@@ +- switch (directory_exists_in_index(istate, dirname, len-1)) { +- case index_directory: +- return path_recurse; ++ enum exist_status status = directory_exists_in_index(istate, dirname, len-1); + +- case index_gitdir: ++ if (status == index_directory) ++ return path_recurse; ++ if (status == index_gitdir) return path_none; ++ if (status != index_nonexistent) ++ BUG("Unhandled value for directory_exists_in_index: %d\n", status); + +- case index_nonexistent: +- if ((dir->flags & DIR_SKIP_NESTED_GIT) || +- !(dir->flags & DIR_NO_GITLINKS)) { +- struct strbuf sb = STRBUF_INIT; +- strbuf_addstr(&sb, dirname); +- nested_repo = is_nonbare_repository_dir(&sb); +- strbuf_release(&sb); +- } +- if (nested_repo) +- return ((dir->flags & DIR_SKIP_NESTED_GIT) ? path_none : +- (excluded ? path_excluded : path_untracked)); ++ if ((dir->flags & DIR_SKIP_NESTED_GIT) || ++ !(dir->flags & DIR_NO_GITLINKS)) { ++ struct strbuf sb = STRBUF_INIT; ++ strbuf_addstr(&sb, dirname); ++ nested_repo = is_nonbare_repository_dir(&sb); ++ strbuf_release(&sb); ++ } ++ if (nested_repo) ++ return ((dir->flags & DIR_SKIP_NESTED_GIT) ? path_none : ++ (excluded ? path_excluded : path_untracked)); - case index_nonexistent: -+ nested_repo = 0; - if ((dir->flags & DIR_SKIP_NESTED_GIT) || - !(dir->flags & DIR_NO_GITLINKS)) { - struct strbuf sb = STRBUF_INIT; +- if (dir->flags & DIR_SHOW_OTHER_DIRECTORIES) +- break; ++ if (!(dir->flags & DIR_SHOW_OTHER_DIRECTORIES)) { + if (excluded && + (dir->flags & DIR_SHOW_IGNORED_TOO) && + (dir->flags & DIR_SHOW_IGNORED_TOO_MODE_MATCHING)) { 5: 40b378e7ad ! 5: d3136ef52f dir: replace exponential algorithm with a linear one @@ -20,26 +20,29 @@ and report all the ignored entries and then report the directory as untracked -- UNLESS all the entries under the directory are ignored, in which case we don't print any of the entries under the - directory and just report the directory itself as ignored. + directory and just report the directory itself as ignored. (Note + that although this forces us to walk all untracked files underneath + the directory as well, we strip them from the output, except for + users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.) * For 'git clean', we may need to recurse into a directory that doesn't match any specified pathspecs, if it's possible that there is an entry underneath the directory that can match one of the pathspecs. In such a case, we need to be careful to omit the - directory itself from the list of paths (see e.g. commit - 404ebceda01c ("dir: also check directories for matching pathspecs", - 2019-09-17)) + directory itself from the list of paths (see commit 404ebceda01c + ("dir: also check directories for matching pathspecs", 2019-09-17)) Part of the tension noted above is that the treatment of a directory can - changed based on the files within it, and based on the various settings + change based on the files within it, and based on the various settings in dir->flags. Trying to keep this in mind while reading over the code, - it is easy to (accidentally?) think in terms of "treat_directory() tells - us what to do with a directory, and read_directory_recursive() is the - thing that recurses". Since we need to look into a directory to know - how to treat it, though, it was quite easy to decide to recurse into the + it is easy to think in terms of "treat_directory() tells us what to do + with a directory, and read_directory_recursive() is the thing that + recurses". Since we need to look into a directory to know how to treat + it, though, it is quite easy to decide to (also) recurse into the directory from treat_directory() by adding a read_directory_recursive() - call. Adding such a call is actually fine, IF we didn't also cause - read_directory_recursive() to recurse into the same directory again. + call. Adding such a call is actually fine, IF we make sure that + read_directory_recursive() does not also recurse into that same + directory. Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs for ignored files", 2017-05-18), added exactly such a case to the code, @@ -58,10 +61,12 @@ Since dir.c is somewhat complex, extra cruft built up around this over time. While trying to unravel it, I noticed several instances where the first call to read_directory_recursive() would return e.g. - path_untracked for a some directory and a later one would return e.g. - path_none, and the code relied on the side-effect of the first adding - untracked entries to dir->entries in order to get the correct output - despite the supposed override in return value by the later call. + path_untracked for some directory and a later one would return e.g. + path_none, despite the fact that the directory clearly should have been + considered untracked. The code happened to work due to the side-effect + from the first invocation of adding untracked entries to dir->entries; + this allowed it to get the correct output despite the supposed override + in return value by the later call. I am somewhat concerned that there are still bugs and maybe even testcases with the wrong expectation. I have tried to carefully @@ -74,9 +79,40 @@ but the rules of existing behavior had so many special cases that I had a hard time coming up with some overarching rules about what correct behavior is for all cases, forcing me to hope that the regression tests - are correct and sufficient. (I'll note that this turmoil makes working - with dir.c extremely unpleasant for me; I keep hoping it'll get better, - but it never seems to.) + are correct and sufficient. Such a hope seems likely to be ill-founded, + given my experience with dir.c-related testcases in the last few months: + + Examples where the documentation was hard to parse or even just wrong: + * 3aca58045f4f (git-clean.txt: do not claim we will delete files with + -n/--dry-run, 2019-09-17) + * 09487f2cbad3 (clean: avoid removing untracked files in a nested git + repository, 2019-09-17) + * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17) + Examples where testcases were declared wrong and changed: + * 09487f2cbad3 (clean: avoid removing untracked files in a nested git + repository, 2019-09-17) + * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17) + * a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within + leading directories", 2019-12-10) + Examples where testcases were clearly inadequate: + * 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an + ignored file breakage, 2019-08-25) + * 7541cc530239 (t7300: add testcases showing failure to clean specified + pathspecs, 2019-09-17) + * a5e916c7453b (dir: fix off-by-one error in match_pathspec_item, + 2019-09-17) + * 404ebceda01c (dir: also check directories for matching pathspecs, + 2019-09-17) + * 09487f2cbad3 (clean: avoid removing untracked files in a nested git + repository, 2019-09-17) + * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17) + * 452efd11fbf6 (t3011: demonstrate directory traversal failures, + 2019-12-10) + * b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19) + Examples where "correct behavior" was unclear to everyone: + https://lore.kernel.org/git/20190905154735.29784-1-newren@xxxxxxxxx/ + Other commits of note: + * 902b90cf42bc (clean: fix theoretical path corruption, 2019-09-17) However, on the positive side, it does make the code much faster. For the following simple shell loop in an empty repository: @@ -111,27 +147,14 @@ 24: 274.45 25: 551.15 - After this fix, those drop to: - - 10: 0.00 - 11: 0.00 - 12: 0.00 - 13: 0.00 - 14: 0.00 - 15: 0.00 - 16: 0.00 - 17: 0.00 - 18: 0.00 - 19: 0.00 - 20: 0.00 - 21: 0.00 - 22: 0.00 - 23: 0.00 - 24: 0.00 - 25: 0.00 + For the above run, using strace I can look for the number of untracked + directories opened and can verify that it matches the expected + 2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth). - In fact, it isn't until a depth of 190 nested directories that it - sometimes starts reporting a time of 0.01 seconds and doesn't + After this fix, with strace I can verify that the number of untracked + directories that are opened drops to just $depth, and the timings all + drop to 0.00. In fact, it isn't until a depth of 190 nested directories + that it sometimes starts reporting a time of 0.01 seconds and doesn't consistently report 0.01 seconds until there are 240 nested directories. The previous code would have taken 17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS @@ -152,17 +175,17 @@ const char *dirname, int len, int baselen, int excluded, const struct pathspec *pathspec) { -- int nested_repo; +- int nested_repo = 0; + /* + * WARNING: From this function, you can return path_recurse or you + * can call read_directory_recursive() (or neither), but + * you CAN'T DO BOTH. + */ + enum path_treatment state; -+ int nested_repo, old_ignored_nr, stop_early; - ++ int nested_repo = 0, old_ignored_nr, stop_early; /* The "len-1" is to strip the final '/' */ - switch (directory_exists_in_index(istate, dirname, len-1)) { + enum exist_status status = directory_exists_in_index(istate, dirname, len-1); + @@ /* This is the "show_other_directories" case */ @@ -292,9 +315,9 @@ - * updates in treat_leading_path(). See the commit message for the - * commit adding this warning as well as the commit preceding it - * for details. -+ * WARNING: Do NOT call recurse unless path_recurse is returned -+ * from treat_path(). Recursing on any other return value -+ * results in exponential slowdown. ++ * WARNING: Do NOT recurse unless path_recurse is returned from ++ * treat_path(). Recursing on any other return value ++ * can result in exponential slowdown. */ - struct cached_dir cdir; 6: 7fb8063541 = 6: 9a3f20656e t7063: blindly accept diffs -- gitgitgadget