> On 19 Dec 2019, at 4:58 am, Ed Maste <emaste@xxxxxxxxxxx> wrote: > > On Tue, 17 Dec 2019 at 19:17, Tom Clarkson <tqclarkson@xxxxxxxxxx> wrote: >> >> The algorithm I am looking at to replace the file based mainline detection is >> >> - If subtree root is unknown (as on the initial split), everything is mainline. >> >> - If subtree root is reachable and mainline root is not, it’s a subtree commit >> >> - Otherwise, treat as mainline. This will also pick up commits from other subtrees but they hopefully won’t contain the subtree folder. I don’t think there is an unambiguous way to distinguish a subtree merge from a regular merge - the message produced is pretty generic. It may be possible to check reachability of all known subtrees, but that adds a fair bit of complexity. >> >> That leaves us with the question of how to record the empty mainline commits. The most correct result for your repro is probably four commits (add/delete everything/restore/modify), but I can see that falling over in a scenario where deleting a subtree is more like unlinking a library than editing that library to do nothing. >> >> Is it sufficiently correct for your scenario to treat ‘restore file1’ as the initial subtree commit? > > My reproduction scenario is really a demonstration of the real issue I > encountered. Running the initial "subtree split" on the real repo > takes about 40 minutes so I wanted something trivial that shows the > same issue. In the demonstration case (i.e., actually removing and > readding the subtree) I think it's reasonable to start with the commit > that added it back. > > Overall I think your proposed algorithm is reasonable (even though I > think it won't address some of the cases in our repo). Will your > algorithm allow us to pass $dir to git rev-list, for the initial > split? Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo. > My actual issue stems from the way svn2git converted some odd svn > history, and is described in more detail on the freebsd-git mailing > list at https://lists.freebsd.org/pipermail/freebsd-git/2019-November/000218.html. > > Perhaps we can have some command-line options to provide metadata for > cases that cannot be inferred? The cases in our repo come from svn2git > creating subtree merges to represent updates from vendor code. AFAIK > these should be basically identical to what subtree creates, except > that we don't have any of the metadata it adds. The existing --onto option comes pretty close - it marks everything in the rev-list of $onto as a subtree commit to be used as-is For more flexibility, I think allowing more manipulation of the cache is the way to go - $cachedir is currently based on process id, but I don’t see any reason it can’t be based on prefix instead. So the process becomes something like # clear the cache - shouldn't usually be necessary, but it's a universal debugging step. git subtree clear-cache --prefix=dir # ref and all its parents are before subtree add. Treat any children as inital commits. git subtree ignore --prefix=dir ref # ref and all its parents are known subtree commits to be included without transformation. git subtree existing --prefix=dir ref # Override an arbitrary mapping, either for performance or because that commit is problematic git subtree map --prefix=dir mainline-ref subtree-ref # Run the existing algorithm, but skipping anything defined manually git subtree split --prefix=dir > For a concrete example (from the repo at > https://github.com/freebsd/freebsd), 7f3a50b3b9f8 is a mainline commit > that added a new subtree, from 9ee787636908. I think that if I could > inform subtree split that 9ee787636908 is the root it would work for > me. Aside from the metadata, that one is a bit different from a standard subtree add in that it copies three folders from the subtree repo rather than the root - so the contents of contrib/elftoolchain will never exactly match the actual elftoolchain repo, and 9ee787636908 is neither mainline nor subtree as subtree split understands it. If you ignore 9ee787636908, the resulting subtree will be fairly clean, but won’t have much of a relationship to the external repo. If you treat 9ee787636908 as an existing subtree, the second commit on your subtree will be based on 7f3a50b3b9f8, which deletes most of the contents of the subtree. You should still be able to merge in updates from the external repo, but if you try to push changes upstream the deletion will break things.