Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir

Tom Clarkson <tqclarkson@xxxxxxxxxx> · Thu, 18 Jun 2020 11:13:02 +1000

> On 18 Jun 2020, at 12:46 am, Ed Maste <emaste@xxxxxxxxxxx> wrote:
> 
> On Fri, 20 Dec 2019 at 10:56, Ed Maste <emaste@xxxxxxxxxxx> wrote:
>> 
>> On Wed, 18 Dec 2019 at 19:57, Tom Clarkson <tqclarkson@xxxxxxxxxx> wrote:
>>> 
>>>> Overall I think your proposed algorithm is reasonable (even though I
>>>> think it won't address some of the cases in our repo). Will your
>>>> algorithm allow us to pass $dir to git rev-list, for the initial
>>>> split?
>>> 
>>> Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo.
>> 
>> Yes, it's for performance reasons on a first split that I'd like to
>> see it. On the FreeBSD repo the difference is some 40 minutes vs. a
>> few seconds.
> 
> Following up on this old thread, I plan to revisit the optimization,
> implementing something on top of your work in
> https://github.com/gitgitgadget/git/pull/493. I might look at adding a
> --initial flag to subtree split, having it essentially auto-detect a
> revision to use as the value for --onto. For the common case of an
> initial merge commit with two parents I think we can relatively easily
> determine which is the subtree parent. If that's not sufficiently
> general (or broadly useful outside of our context) we could just
> create a helper script wrapping `subtree split` tailored to the
> FreeBSD cases. We have something like 100 projects we're looking to
> split, as part of our svn to git migration.

The new use command might be a better fit than onto in this case - it does the same thing as onto, except it also marks the commit as processed and therefore excludes them from the initial rev list.

Actually, on reading the code, I’m not sure onto does quite what the documentation suggests it does - by updating the cache it will shortcut processing of subtree commits that have already been merged into mainline, but has no mechanism for building onto an existing unrelated history.

Reliably differentiating subtree and mainline commits has always been tricky, but should be ok as part of an advanced flag/new command. Perhaps rev-list --merges <path> to find potential unmarked subtree merges, then take the one where the root tree matches the post merge subdir tree. No doubt it won’t catch everything, but I’d say that’s less of a risk than false positives.

In the context of a helper script, a new command or adding a --auto flag to use might be better than adding a flag to split - that way you could easily tell if the expected initial state was found rather than having to wait for the full process to produce something weird. 

That would also let you mark the other side of the merge as ignored mainline history - a significant optimization when you’re excluding 200k commits, but risky to include more generally.