[PATCH v2 0/6] Avoid multiple recursive calls for same path in read_directory_recursive()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This patch series builds on en/fill-directory-fixes-more. This series should
be considered an RFC because of the untracked-cache changes (see the last
two commits), for which I'm hoping to get an untracked-cache expert to
comment. This series does provide some modest speedups (see second to last
commit message), and should allow 'git status --ignored' to complete in a
more reasonable timeframe for Martin Melka (see 
https://lore.kernel.org/git/CANt4O2L_DZnMqVxZzTBMvr=BTWqB6L0uyORkoN_yMHLmUX7yHw@xxxxxxxxxxxxxx/
)

Changes since v1:

 * Replaced patch 4 with improved version from Stolee (with additional
   improvement of my own)
 * Clarifications, wording fixes, and more about linear perf in commit
   message to patch 5
 * More detail in patch 5 about why "whackamole" particularly makes me
   uneasy for dir.c

Stuff clearly still missing from v2:

 * I didn't make the DIR_KEEP_UNTRACKED_CONTENTS changes I mentioned in 
   https://lore.kernel.org/git/CABPp-BEQ5s=+6Rnb-A+pdEaoPXxfo-hMSegSe1eai=RE74A3Og@xxxxxxxxxxxxxx/ 
   which I think would make the code cleaner & clearer.
 * I still have not addressed the untracked-cache issue mentioned in the
   last two commits. I looked at it very, very briefly, but I was really
   close to doing something similar to [1] and just dropping my patches in
   this series before even submitting them on Wednesday[2] (dir.c is a
   really unpleasant to work in). Other than wording fixes, I just need a
   week or two off from this area before I dig further, unless someone else
   wants to dive in and needs me to provide pointers on what I've done so
   far.

[1] 
https://lore.kernel.org/git/pull.676.v3.git.git.1576571586.gitgitgadget@xxxxxxxxx/
[2] I was inches from doing that Wednesday morning. I had done several
rounds of "Okay, I fixed all the tests that broke with my changes last time,
let's re-run the testsuite -- wow, four totally different tests from
testfiles I hadn't looked at before now break", and decided that I would
only do one more before dropping it an maybe coming back in a month or two.
That time happened to work, minus the untracked-cache, so I decided to put
it in front of other eyeballs.

Derrick Stolee (1):
  dir: refactor treat_directory to clarify control flow

Elijah Newren (5):
  dir: consolidate treat_path() and treat_one_path()
  dir: fix broken comment
  dir: fix confusion based on variable tense
  dir: replace exponential algorithm with a linear one
  t7063: blindly accept diffs

 dir.c                             | 331 +++++++++++++++++-------------
 t/t7063-status-untracked-cache.sh |  50 ++---
 2 files changed, 208 insertions(+), 173 deletions(-)


base-commit: 0cbb60574e741e8255ba457606c4c90898cfc755
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-700%2Fnewren%2Ffill-directory-exponential-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-700/newren/fill-directory-exponential-v2
Pull-Request: https://github.com/git/git/pull/700

Range-diff vs v1:

 1:  27bc135796 = 1:  27bc135796 dir: consolidate treat_path() and treat_one_path()
 2:  2ceb64ae61 = 2:  2ceb64ae61 dir: fix broken comment
 3:  e6d21228d1 = 3:  e6d21228d1 dir: fix confusion based on variable tense
 4:  3b2ec5eaf6 ! 4:  f73f0d66d1 dir: move setting of nested_repo next to its actual usage
     @@ -1,26 +1,73 @@
     -Author: Elijah Newren <newren@xxxxxxxxx>
     +Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     -    dir: move setting of nested_repo next to its actual usage
     +    dir: refactor treat_directory to clarify control flow
      
     +    The logic in treat_directory() is handled by a multi-case
     +    switch statement, but this switch is very asymmetrical, as
     +    the first two cases are simple but the third is more
     +    complicated than the rest of the method. In fact, the third
     +    case includes a "break" statement that leads to the block
     +    of code outside the switch statement. That is the only way
     +    to reach that block, as the switch handles all possible
     +    values from directory_exists_in_index();
     +
     +    Extract the switch statement into a series of "if" statements.
     +    This simplifies the trivial cases, while clarifying how to
     +    reach the "show_other_directories" case. This is particularly
     +    important as the "show_other_directories" case will expand
     +    in a later change.
     +
     +    Helped-by: Elijah Newren <newren@xxxxxxxxx>
     +    Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
          Signed-off-by: Elijah Newren <newren@xxxxxxxxx>
      
       diff --git a/dir.c b/dir.c
       --- a/dir.c
       +++ b/dir.c
      @@
     - 	const char *dirname, int len, int baselen, int excluded,
       	const struct pathspec *pathspec)
       {
     --	int nested_repo = 0;
     -+	int nested_repo;
     - 
     + 	int nested_repo = 0;
     +-
       	/* The "len-1" is to strip the final '/' */
     - 	switch (directory_exists_in_index(istate, dirname, len-1)) {
     -@@
     +-	switch (directory_exists_in_index(istate, dirname, len-1)) {
     +-	case index_directory:
     +-		return path_recurse;
     ++	enum exist_status status = directory_exists_in_index(istate, dirname, len-1);
     + 
     +-	case index_gitdir:
     ++	if (status == index_directory)
     ++		return path_recurse;
     ++	if (status == index_gitdir)
       		return path_none;
     ++	if (status != index_nonexistent)
     ++		BUG("Unhandled value for directory_exists_in_index: %d\n", status);
     + 
     +-	case index_nonexistent:
     +-		if ((dir->flags & DIR_SKIP_NESTED_GIT) ||
     +-		    !(dir->flags & DIR_NO_GITLINKS)) {
     +-			struct strbuf sb = STRBUF_INIT;
     +-			strbuf_addstr(&sb, dirname);
     +-			nested_repo = is_nonbare_repository_dir(&sb);
     +-			strbuf_release(&sb);
     +-		}
     +-		if (nested_repo)
     +-			return ((dir->flags & DIR_SKIP_NESTED_GIT) ? path_none :
     +-				(excluded ? path_excluded : path_untracked));
     ++	if ((dir->flags & DIR_SKIP_NESTED_GIT) ||
     ++		!(dir->flags & DIR_NO_GITLINKS)) {
     ++		struct strbuf sb = STRBUF_INIT;
     ++		strbuf_addstr(&sb, dirname);
     ++		nested_repo = is_nonbare_repository_dir(&sb);
     ++		strbuf_release(&sb);
     ++	}
     ++	if (nested_repo)
     ++		return ((dir->flags & DIR_SKIP_NESTED_GIT) ? path_none :
     ++			(excluded ? path_excluded : path_untracked));
       
     - 	case index_nonexistent:
     -+		nested_repo = 0;
     - 		if ((dir->flags & DIR_SKIP_NESTED_GIT) ||
     - 		    !(dir->flags & DIR_NO_GITLINKS)) {
     - 			struct strbuf sb = STRBUF_INIT;
     +-		if (dir->flags & DIR_SHOW_OTHER_DIRECTORIES)
     +-			break;
     ++	if (!(dir->flags & DIR_SHOW_OTHER_DIRECTORIES)) {
     + 		if (excluded &&
     + 		    (dir->flags & DIR_SHOW_IGNORED_TOO) &&
     + 		    (dir->flags & DIR_SHOW_IGNORED_TOO_MODE_MATCHING)) {
 5:  40b378e7ad ! 5:  d3136ef52f dir: replace exponential algorithm with a linear one
     @@ -20,26 +20,29 @@
               and report all the ignored entries and then report the directory as
               untracked -- UNLESS all the entries under the directory are
               ignored, in which case we don't print any of the entries under the
     -         directory and just report the directory itself as ignored.
     +         directory and just report the directory itself as ignored.  (Note
     +         that although this forces us to walk all untracked files underneath
     +         the directory as well, we strip them from the output, except for
     +         users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.)
      
             * For 'git clean', we may need to recurse into a directory that
               doesn't match any specified pathspecs, if it's possible that there
               is an entry underneath the directory that can match one of the
               pathspecs.  In such a case, we need to be careful to omit the
     -         directory itself from the list of paths (see e.g. commit
     -         404ebceda01c ("dir: also check directories for matching pathspecs",
     -         2019-09-17))
     +         directory itself from the list of paths (see commit 404ebceda01c
     +         ("dir: also check directories for matching pathspecs", 2019-09-17))
      
          Part of the tension noted above is that the treatment of a directory can
     -    changed based on the files within it, and based on the various settings
     +    change based on the files within it, and based on the various settings
          in dir->flags.  Trying to keep this in mind while reading over the code,
     -    it is easy to (accidentally?) think in terms of "treat_directory() tells
     -    us what to do with a directory, and read_directory_recursive() is the
     -    thing that recurses".  Since we need to look into a directory to know
     -    how to treat it, though, it was quite easy to decide to recurse into the
     +    it is easy to think in terms of "treat_directory() tells us what to do
     +    with a directory, and read_directory_recursive() is the thing that
     +    recurses".  Since we need to look into a directory to know how to treat
     +    it, though, it is quite easy to decide to (also) recurse into the
          directory from treat_directory() by adding a read_directory_recursive()
     -    call.  Adding such a call is actually fine, IF we didn't also cause
     -    read_directory_recursive() to recurse into the same directory again.
     +    call.  Adding such a call is actually fine, IF we make sure that
     +    read_directory_recursive() does not also recurse into that same
     +    directory.
      
          Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs
          for ignored files", 2017-05-18), added exactly such a case to the code,
     @@ -58,10 +61,12 @@
          Since dir.c is somewhat complex, extra cruft built up around this over
          time.  While trying to unravel it, I noticed several instances where the
          first call to read_directory_recursive() would return e.g.
     -    path_untracked for a some directory and a later one would return e.g.
     -    path_none, and the code relied on the side-effect of the first adding
     -    untracked entries to dir->entries in order to get the correct output
     -    despite the supposed override in return value by the later call.
     +    path_untracked for some directory and a later one would return e.g.
     +    path_none, despite the fact that the directory clearly should have been
     +    considered untracked.  The code happened to work due to the side-effect
     +    from the first invocation of adding untracked entries to dir->entries;
     +    this allowed it to get the correct output despite the supposed override
     +    in return value by the later call.
      
          I am somewhat concerned that there are still bugs and maybe even
          testcases with the wrong expectation.  I have tried to carefully
     @@ -74,9 +79,40 @@
          but the rules of existing behavior had so many special cases that I had
          a hard time coming up with some overarching rules about what correct
          behavior is for all cases, forcing me to hope that the regression tests
     -    are correct and sufficient.  (I'll note that this turmoil makes working
     -    with dir.c extremely unpleasant for me; I keep hoping it'll get better,
     -    but it never seems to.)
     +    are correct and sufficient.  Such a hope seems likely to be ill-founded,
     +    given my experience with dir.c-related testcases in the last few months:
     +
     +      Examples where the documentation was hard to parse or even just wrong:
     +       * 3aca58045f4f (git-clean.txt: do not claim we will delete files with
     +                       -n/--dry-run, 2019-09-17)
     +       * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
     +                       repository, 2019-09-17)
     +       * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
     +      Examples where testcases were declared wrong and changed:
     +       * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
     +                       repository, 2019-09-17)
     +       * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
     +       * a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within
     +                       leading directories", 2019-12-10)
     +      Examples where testcases were clearly inadequate:
     +       * 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an
     +                       ignored file breakage, 2019-08-25)
     +       * 7541cc530239 (t7300: add testcases showing failure to clean specified
     +                       pathspecs, 2019-09-17)
     +       * a5e916c7453b (dir: fix off-by-one error in match_pathspec_item,
     +                       2019-09-17)
     +       * 404ebceda01c (dir: also check directories for matching pathspecs,
     +                       2019-09-17)
     +       * 09487f2cbad3 (clean: avoid removing untracked files in a nested git
     +                       repository, 2019-09-17)
     +       * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)
     +       * 452efd11fbf6 (t3011: demonstrate directory traversal failures,
     +                       2019-12-10)
     +       * b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19)
     +      Examples where "correct behavior" was unclear to everyone:
     +        https://lore.kernel.org/git/20190905154735.29784-1-newren@xxxxxxxxx/
     +      Other commits of note:
     +       * 902b90cf42bc (clean: fix theoretical path corruption, 2019-09-17)
      
          However, on the positive side, it does make the code much faster.  For
          the following simple shell loop in an empty repository:
     @@ -111,27 +147,14 @@
              24: 274.45
              25: 551.15
      
     -    After this fix, those drop to:
     -
     -        10: 0.00
     -        11: 0.00
     -        12: 0.00
     -        13: 0.00
     -        14: 0.00
     -        15: 0.00
     -        16: 0.00
     -        17: 0.00
     -        18: 0.00
     -        19: 0.00
     -        20: 0.00
     -        21: 0.00
     -        22: 0.00
     -        23: 0.00
     -        24: 0.00
     -        25: 0.00
     +    For the above run, using strace I can look for the number of untracked
     +    directories opened and can verify that it matches the expected
     +    2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth).
      
     -    In fact, it isn't until a depth of 190 nested directories that it
     -    sometimes starts reporting a time of 0.01 seconds and doesn't
     +    After this fix, with strace I can verify that the number of untracked
     +    directories that are opened drops to just $depth, and the timings all
     +    drop to 0.00.  In fact, it isn't until a depth of 190 nested directories
     +    that it sometimes starts reporting a time of 0.01 seconds and doesn't
          consistently report 0.01 seconds until there are 240 nested directories.
          The previous code would have taken
            17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
     @@ -152,17 +175,17 @@
       	const char *dirname, int len, int baselen, int excluded,
       	const struct pathspec *pathspec)
       {
     --	int nested_repo;
     +-	int nested_repo = 0;
      +	/*
      +	 * WARNING: From this function, you can return path_recurse or you
      +	 *          can call read_directory_recursive() (or neither), but
      +	 *          you CAN'T DO BOTH.
      +	 */
      +	enum path_treatment state;
     -+	int nested_repo, old_ignored_nr, stop_early;
     - 
     ++	int nested_repo = 0, old_ignored_nr, stop_early;
       	/* The "len-1" is to strip the final '/' */
     - 	switch (directory_exists_in_index(istate, dirname, len-1)) {
     + 	enum exist_status status = directory_exists_in_index(istate, dirname, len-1);
     + 
      @@
       
       	/* This is the "show_other_directories" case */
     @@ -292,9 +315,9 @@
      -	 * updates in treat_leading_path().  See the commit message for the
      -	 * commit adding this warning as well as the commit preceding it
      -	 * for details.
     -+	 * WARNING: Do NOT call recurse unless path_recurse is returned
     -+	 *          from treat_path().  Recursing on any other return value
     -+	 *          results in exponential slowdown.
     ++	 * WARNING: Do NOT recurse unless path_recurse is returned from
     ++	 *          treat_path().  Recursing on any other return value
     ++	 *          can result in exponential slowdown.
       	 */
      -
       	struct cached_dir cdir;
 6:  7fb8063541 = 6:  9a3f20656e t7063: blindly accept diffs

-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux