On Thu, Oct 26, 2023 at 9:18 PM Zach FettersMoore via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote: > > From: Zach FettersMoore <zach.fetters@xxxxxxxxxxxxxxxxx> > > When there are multiple subtrees present in a repository and they are > all using 'git subtree split', the 'split' command can take a > significant (and constantly growing) amount of time to run even when > using the '--rejoin' flag. This is due to the fact that when processing > commits to determine the last known split to start from when looking > for changes, if there has been a split/merge done from another subtree > there will be 2 split commits, one mainline and one subtree, for the > second subtree that are part of the processing. The non-mainline > subtree split commit will cause the processing to always need to search > the entire history of the given subtree as part of its processing even > though those commits are totally irrelevant to the current subtree > split being run. Thanks for your continued work on this! I am not familiar with git subtree so I might miss obvious things. On the other hand, my comments might help increase a bit the number of people who could review this patch. > In the diagram below, 'M' represents the mainline repo branch, 'A' > represents one subtree, and 'B' represents another. M3 and B1 represent > a split commit for subtree B that was created from commit M4. M2 and A1 > represent a split commit made from subtree A that was also created > based on changes back to and including M4. M1 represents new changes to > the repo, in this scenario if you try to run a 'git subtree split > --rejoin' for subtree B, commits M1, M2, and A1, will be included in > the processing of changes for the new split commit since the last > split/rejoin for subtree B was at M3. The issue is that by having A1 > included in this processing the command ends up needing to processing > every commit down tree A even though none of that is needed or relevant > to the current command and result. > > M1 > | \ \ > M2 | | > | A1 | > M3 | | > | | B1 > M4 | | About the above, Junio already commented the following: -> The above paragraph explains which different things you drew in the -> diagram are representing, but it is not clear how they relate to -> each other. Do they for example depict parent-child commit -> relationship? What are the wide gaps between these three tracks and -> what are the short angled lines leaning to the left near the tip? -> Is the time/topology flowing from bottom to top? and it doesn't look like you have addressed that comment. When you say "M3 and B1 represent a split commit for subtree B that was created from commit M4." I am not sure what it means exactly. Could you give example commands that could have created the M3 and B1 commits? > So this commit makes a change to the processing of commits for the split > command in order to ignore non-mainline commits from other subtrees such > as A1 in the diagram by adding a new function > 'should_ignore_subtree_commit' which is called during > 'process_split_commit'. This allows the split/rejoin processing to still > function as expected but removes all of the unnecessary processing that > takes place currently which greatly inflates the processing time. Could you tell a bit more what kind of processing time reduction is or would be possible on what kind of repo? Have you benchmark-ed or just timed this somehow on one of your repos or better on an open source repo (so that we could reproduce if we wanted)? > Added a test to validate that the proposed fix > solves the issue. > > The test accomplishes this by checking the output > of the split command to ensure the output from > the progress of 'process_split_commit' function > that represents the 'extracount' of commits > processed does not increment. Does not increment compared to what? > This was tested against the original functionality > to show the test failed, and then with this fix > to show the test passes. > > This illustrated that when using multiple subtrees, > A and B, when doing a split on subtree B, the > processing does not traverse the entire history > of subtree A which is unnecessary and would cause > the 'extracount' of processed commits to climb > based on the number of commits in the history of > subtree A. Does this mean that the test checks that the extracount is the same when subtree A exists as when it doesn't exist? [...] > contrib/subtree/git-subtree.sh | 29 ++++++++++++++++++++- > contrib/subtree/t/t7900-subtree.sh | 42 ++++++++++++++++++++++++++++++ > 2 files changed, 70 insertions(+), 1 deletion(-) > > diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh > index e0c5d3b0de6..e69991a9d80 100755 > --- a/contrib/subtree/git-subtree.sh > +++ b/contrib/subtree/git-subtree.sh > @@ -778,6 +778,20 @@ ensure_valid_ref_format () { > die "fatal: '$1' does not look like a ref" > } > > +# Usage: check if a commit from another subtree should be > +# ignored from processing for splits > +should_ignore_subtree_split_commit () { Maybe adding: assert test $# = 1 local rev="$1" here, and using $rev instead of $1 in this function could make things a bit clearer and similar to what is done in other functions. > + if test -n "$(git log -1 --grep="git-subtree-dir:" $1)" > + then > + if test -z "$(git log -1 --grep="git-subtree-mainline:" $1)" && > + test -z "$(git log -1 --grep="git-subtree-dir: $arg_prefix$" $1)" > + then > + return 0 > + fi > + fi > + return 1 > +} The above doesn't seem to be properly indented. We use tabs not spaces. > # Usage: process_split_commit REV PARENTS > process_split_commit () { > assert test $# = 2 > @@ -963,7 +977,20 @@ cmd_split () { > eval "$grl" | > while read rev parents > do > - process_split_commit "$rev" "$parents" > + if should_ignore_subtree_split_commit "$rev" > + then > + continue > + fi > + parsedParents='' It seems to me that we name variables "parsed_parents" (or sometimes "parsedparents") rather than "parsedParents". > + for parent in $parents > + do > + should_ignore_subtree_split_commit "$parent" > + if test $? -eq 1 I think the 2 lines above could be replaced by: + if ! should_ignore_subtree_split_commit "$parent" > + then > + parsedParents+="$parent " It doesn't seem to me that we use "+=" much in our shell scripts. https://www.shellcheck.net/ emits the following: (warning): In POSIX sh, += is undefined. so I guess we don't use it because it's not available in some usual shells. (I haven't checked the script with https://www.shellcheck.net/ before and after your patch, but it might help avoid bash-isms and such issues.) > + fi > + done > + process_split_commit "$rev" "$parsedParents" > done || exit $? It looks like we use "exit $?" a lot in git-subtree.sh while we use just "exit" most often elsewhere. Not sure why. > latest_new=$(cache_get latest_new) || exit $? > diff --git a/contrib/subtree/t/t7900-subtree.sh b/contrib/subtree/t/t7900-subtree.sh > index 49a21dd7c9c..87d59afd761 100755 > --- a/contrib/subtree/t/t7900-subtree.sh > +++ b/contrib/subtree/t/t7900-subtree.sh > @@ -385,6 +385,48 @@ test_expect_success 'split sub dir/ with --rejoin' ' > ) > ' > > +test_expect_success 'split with multiple subtrees' ' > + subtree_test_create_repo "$test_count" && > + subtree_test_create_repo "$test_count/subA" && > + subtree_test_create_repo "$test_count/subB" && > + test_create_commit "$test_count" main1 && > + test_create_commit "$test_count/subA" subA1 && > + test_create_commit "$test_count/subA" subA2 && > + test_create_commit "$test_count/subA" subA3 && > + test_create_commit "$test_count/subB" subB1 && > + ( > + cd "$test_count" && > + git fetch ./subA HEAD && > + git subtree add --prefix=subADir FETCH_HEAD > + ) && > + ( > + cd "$test_count" && > + git fetch ./subB HEAD && > + git subtree add --prefix=subBDir FETCH_HEAD > + ) && > + test_create_commit "$test_count" subADir/main-subA1 && > + test_create_commit "$test_count" subBDir/main-subB1 && > + ( > + cd "$test_count" && > + git subtree split --prefix=subADir --squash --rejoin -m "Sub A Split 1" > + ) && Not sure why there are so many sub-shells used, and why the -C option is not used instead to tell Git to work in a subdirectory. I guess you copied what most existing (old) tests in this test script do. For example perhaps the 4 line above could be replaced by just: + git -C "$test_count" subtree split --prefix=subADir --squash --rejoin -m "Sub A Split 1" && > + ( > + cd "$test_count" && > + git subtree split --prefix=subBDir --squash --rejoin -m "Sub B Split 1" > + ) && > + test_create_commit "$test_count" subADir/main-subA2 && > + test_create_commit "$test_count" subBDir/main-subB2 && > + ( > + cd "$test_count" && > + git subtree split --prefix=subADir --squash --rejoin -m "Sub A Split 2" > + ) && > + ( > + cd "$test_count" && > + test "$(git subtree split --prefix=subBDir --squash --rejoin \ > + -d -m "Sub B Split 1" 2>&1 | grep -w "\[1\]")" = "" > + ) > +' It's not clear to me what the test is doing. Maybe you could split it into 2 tests. Perhaps one setting up a repo with multiple subtrees and one checking that a new split ignores other subtree split commits. Perhaps adding a few comments would help too. Best, Christian.