Re: [PATCH v2 0/7] subtree: Fix handling of complex history

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 7 Oct 2020 21:46:36 +0200 (CEST)

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> Fixes several issues that could occur when running subtree split on large
> repos with more complex history.
>
>  1. A merge commit could bypass the known start point of the subtree, which
>     would cause the entire history to be processed recursively, leading to a
>     stack overflow / segfault after reading a few hundred commits. Older
>     commits are now explicitly recorded as irrelevant so that the recursive
>     process can terminate on any mainline commit rather than only on subtree
>     joins and initial commits.
>
>
>  2. It is possible for a repo to contain subtrees that lack the metadata
>     that is usually present in add/join commit messages (git-svn at least
>     can produce such a structure). The new use/ignore/map commands allow the
>     user to provide that information for any problematic commits.
>
>
>  3. A mainline commit that does not contain the subtree folder could be
>     erroneously identified as a subtree commit, which would add the entire
>     mainline history to the subtree. Commits will now only be used as is if
>     all their parents are already identified as subtree commits. While the
>     new code can still be tripped up by unusual folder structures, the
>     completely unambiguous solution turned out to involve a significant
>     performance penalty, and the new ignore / use commands provide a
>     workaround for that scenario.

I gave this as thorough a review as I can (which is not saying too much,
as I am not exactly familiar with `git subtree`'s inner workings).

Hopefully some of my comments and suggestions are helpful.

At some stage, especially given the problems I pointed out with the
implementation detail that is a flat directory with a potentially insane
number of files in it, I think it would make a lot of sense to go ahead
and turn this into a built-in Git command, implemented in C, and with a
more robust file system layout of its cache.

Ciao,
Dscho