Re: RFD: best way to automatically rewrite a git DAG as a linear history?

Jon Seymour <jon.seymour@xxxxxxxxx> · Sat, 20 Feb 2010 13:25:11 +1100

On Sat, Feb 20, 2010 at 7:20 AM, Avery Pennarun <apenwarr@xxxxxxxxx> wrote:
> On Fri, Feb 19, 2010 at 2:29 AM, Jon Seymour <jon.seymour@xxxxxxxxx> wrote:
>> If there are no merge conflicts in the original history, then there
>> will be no merge conflicts in the rewritten history, and therefore no
>> error deltas.
>
> I'm just worried that this is a bit misleading.  Just because there
> are no conflicts at the end doesn't mean these generated interim
> versions ever compiled or worked, does it?
>
> ... example elided
>
> Now I linearize it in the way you propose, removing the "unnecessary"
> merges but keeping the developer's conflict resolutions.  What I end
> up with is the last code segment above - but I *don't* have the rest
> of the patch that added the extra parameter to g.  So my conflict
> resolution is wrong for the code that remains.  And the delta fixup
> doesn't show that there was anything weird.
>

The thing is, if I linearised in the way I proposed, there would be a
conflict during the rebase of one branch of the merge onto the other -
so the conflict would be still be there.

However, what happens is that rather than stopping to manually correct
the conflict, I compensate for the conflict by introducing a patch
that restores the file to state it was in when the conflicting commit
was originally committed. This guarantees that all future picks for
that file will apply correctly.

You are entirely correct that the compensated file is no longer
consistent with the rest of the tree at that point in the history.
However, once the rewritten history passes the original merge, the
consistency of the tree will be restored (reason: by design the
rewritten history always restores the state of the tree the point in
the rewritten history that matches a merge in the original history).
The inconsistencies are thus limited in scope [ and, as you will well
see, well delimited ]

> Unless I've misunderstood something, you've thrown away the
> *advantage* that was autodetection of conflicts, in favour of having
> to eyeball it.  I'm not sure there's an advantage there.
>

The conflicts are detected and clearly marked in the history. What I
have done is simply defer the resolution of the conflicts to a point
of my later choosing so that I can continue with the linearisation
process automatically.

One nice consequence of my linearisation is that (even in the presence
of compensations) any diff between two points in the linearized
history will only show files touched by the history - not files that
changed in the upstream branch. True, some of these diffs will produce
nonsensical results (in particular diffs within a pair of compensating
deltas are not necessarily useful, nor are diffs that include one
compensation but not the other). However, diffs across histories that
do include both compensations will be sensible.

An example:

A-B-C-D-E-F-G-H
 \       \      \
  \-M-N-P-Q-R-S-T

(Merge at Q between P and E, merge at T between S and H)

Suppose a change made in C conflicts with a change made in N.

This was eventually resolved in the merge commit Q. There were no
conflicts in subsequent histories after E and Q (e.g. the merge at T
was clean)

My linearisation would first rebase on E.

    A B C D E M' e(^N) N' P' e(Q^) R' S' T'

e(^N) returns the state of conflicted files to the point they were
before N was applied.
q(Q^) returns the state of conflicted files to the point the were
after the merge at Q.

The conflict between C and N forced me to backtrack the conflicted
files to the state they were before N. This is done by the
compensation e(^N).
The commits N and P are then repicked and are guaranteed to succeed
(by definition - the conflicted files are now in the same state as
they originally were).
A reverse compensation e(Q^) is applied to ensure that R' sees the
tree as it was at Q.

The algorithm then rebases this history onto H.

    A B C D E F G H M'' e(^e(^N)) N'' P'' e(e(Q^)^) R'' S''

The following is true:

* the tree at e(e(Q^)^) is identical to the state of the tree at Q
* the tree at S'' is identical to the state of the tree at T
* the trees at M'' R'' S'' are consistent with the states that would
have been if rebases had been performed at Q and T with the same
conflict resolution outcomes.
* the trees at N'' P'' are internally inconsistent because some (but
not all) files have 'slipped backwards in time'
* the diff between M'' and R'' will be the equivalent to the diff
between M and R
* diffs between any points with M'' and S'' will only show files
touched by edits at M, N, P, R, S or the merge Q (the intermediate
rebasing on E and H dissolves the merge at T)
* the series e(^e(^N)) N'' P'' e(e(Q^)^) could be squashed into a
single commit that would correspond to the edits done in N and P and
the conflict resolution done in Q.
* the merge history ^A T  has been automatically rewritten as a rebase
history ^H S''

An interesting possibility would be to squash  e(^e(^N)) N'' e(e(Q^)^)
as N''' and then edit P'' to be consistent with a new base, N'''.  (In
fact, in this case P'' would simply be the diff between N''' and Q -
it gets more complicated if there is a series of intervening commits
between N and P).

If this worked (and there is no guarantee it would), then it would
have the effect of folding the conflict resolution with C performed at
Q into a re-edited commit N''' which is where it would have occurred
had ^A N been rebased on C in the first place.

>> In the no  conflict case, it is not clear to me that the history
>> resulting from your script is immediately rebaseable, precisely
>> because of the presence of the merge commits [ feel free to correct me
>> if I am wrong about that ] . With my approach, the merge commits
>> dissolve away - there is nothing to edit.
>
> I'm pretty sure that in the absence of conflicts, you could rebase -i
> my linearization and just remove the merge commits by hand, and things
> should go pretty smoothly.  Or in the simplest cases (ie. the merged
> code is identical), rebase would notice that the merge patches have
> already been applied, and simply throw them away.
>
> In any case, I guess if what you're doing works for you, then go for
> it.  But in that case I'm not sure why you asked your original
> question; what about your current method *doesn't* do what you want?
>

Part of the reason I asked is that I was wondering whether someone had
hit upon this solution already, and if they had, I'd take advantage of
it. My discussions with you have certainly helped me think about and
articulate the nature of the inconsistencies that are introduced, so
that has helped too.

> If it's just a question of always auto-resolving conflicts using the
> local merge resolution, you might be interested in -Xours and
> -Xtheirs:
> http://n2.nabble.com/PATCH-0-8-The-return-of-Xours-Xtheirs-Xsubtree-dir-td4069081.html
>
> If you're looking for more general suggestions about what to do when
> untangling a developer's horribly over-merged tree, you may want to
> consider a simple but inelegant solution that I've used myself
> sometimes: just squash the entire diff from upstream to developer's
> version into a single commit, then rip it apart again by hand.  In my
> experience, developers who make messes of merges also don't divide
> their commits into sensible fragments in the first place, so
> re-dividing them yourself afterward is actually the fastest route to
> sanity.

Yep, I agree that's not a bad option in many cases. It's pragmatic,
though rebuilding a history where the tree is internally consistent at
each point is a somewhat tedious. Linearisation at least holds out the
possibility of restoring consistency if you really want it for some
reason but allows you to defer the costs of doing so if you don't
really need it.

>
> Hope this helps.
>

It has.

> Have fun,

I will - I have created a github project called hammer where I will
make a realisation of these ideas available for evaluation at some
point.

git@xxxxxxxxxx:jonseymour/hammer.git

(A hammer is, of course, a reasonably good tool for making things flat
although it does tend to break things along the way).

jon.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html