Beyond Merge and Rebase: The Upstream Import Approach in Git

Aleksander Korzyński <ak@xxxxxxxxxx> · Tue, 11 Jul 2023 10:24:42 +0200

Hello,

Git users often have to make a choice: to merge or rebase. I'm going
to describe a third way that has the characteristics of both and is
very well suited for tracking an open-source project or any other
upstream branch. I'm looking for feedback on the approach.

MERGE OR REBASE?

Let's assume that you have forked an upstream open-source repository
and keep the fork in your own repo. The default branch of the upstream
repository is called "main" and is called the same in your own fork.
You have made a few changes to the source code and committed them to
the "main" branch of your fork. In the meantime, new changes have been
committed to the upstream "main" branch of the project. How do you
import the upstream changes to your fork?

Let's assume that your local fork also contains a branch called
"upstream/main", which reflects the state of the upstream's "main"
branch. So the "main" branch contains your own changes and the
"upstream/main" branch contains the community's changes:

  time -->

  o---o---o---o---o  upstream/main
       \
        o---o---o  main

So a different way to ask the question is: how do you bring
upstream/main's changes into main?

One solution is to merge "upstream/main" into "main":

  o---o---o---o---o  upstream/main
       \           \
        o---o---o---M  main

The merge above would certainly work, but it becomes problematic as
time passes and you get a lot of these merges in your "main" branch.
You then no longer have visibility into the differences between
"upstream/main" and "main", because your commits get lost deep in the
history of the branch, as illustrated below:

  o---o---o---o---o---o---o---o---o---o---o  upstream/main
       \           \       \       \       \
        o---o---o---M---o---M---o---M---o---M  main

So the alternative solution is to rebase your "main" branch on top of
"upstream/main":

  o---o---o---o---o  upstream/main
                   \
                    o'---o'---o'  main

You now have the advantage of having greater visibility into the
differences between "upstream/main" and "main". However, a rebase
comes with a different problem: if any user of your fork had the
"main" branch checked out in their local repository and they run "git
pull", they are going to get an error stating that the local and
upstream branches have diverged. They will have to take special steps
to recover from the rebase of the "main" branch.

So how to solve that problem?

THE THIRD WAY - UPSTREAM IMPORT

The proposed third way is a special operation that (in the described
use case) has the advantages of both a merge and a rebase, without the
disadvantages. The approach is illustrated below:

  o---o---o---o---o  upstream/main
       \           \
        \           o'---o'---o'
         \                     \
          o---o---o-------------S  main

First, the divergent commits from "main" are rebased on top of
"upstream/main", but then they are combined back with "main" using a
special merge commit, which has a custom strategy: it replaces the old
content of "main" with the new rebased content. This last commit is
the secret sauce of this solution: the commit has two parents, like an
ordinary merge, but has the semantics of a rebase.

The structure above has the advantages of both a merge and a rebase.
On the one hand, just like with an ordinary merge, a user who runs
"git pull" on their local copy of "main" is not going to see the error
about divergent branches. On the other hand, just like with an
ordinary rebase, there is visibility into the last imported commit
from "upstream/main" and the differences between that commit and the
tip of "main".

DROPPING PATCHES

What is supposed to happen if one of the commits from "main" is ported
to "upstream/main", as illustrated below?

  o---o---o---A'---o  upstream/main
       \
        \
         \
          A---B---C  main

In that case, the upstream importing operation should drop that patch,
as illustrated below:

  o---o---o---A'---o  upstream/main
       \            \
        \            B'---C'
         \                 \
          A---B---C---------S  main

But how would the upstream importing operation know which patches to
drop? There are one of two ways.

Firstly, it can look at the git's patch-id, which is the SHA of the
file changes with line numbers ignored. This is the same strategy that
rebase uses to drop duplicate commits.

Secondly, it can use an arbitrary change-id associated with a commit
(for example, for projects that use Gerrit, it can be the Gerrit's
Change-Id, which is saved in the commit message). This is useful when
a given patch lands upstream in a slightly changed form, but is meant
to replace the version in "main".

IMPLEMENTATION

The solution above has already been implemented in an open-source
Python script called git-upstream[1], published 10 years ago. It was
originally implemented for the OpenStack project, but the solution is
generic and applicable to any open-source project. It is going to be
easier for users to benefit from the ideas behind git-upstream if the
functionality is integrated directly into git.

Would you like to see the above functionality integrated directly into git?

Best regards,
Aleksander Korzynski

www.linkedin.com/in/akorzy
www.devopsera.com/blog

P.S.

For completeness, I'm providing links to alternative solutions for
tracking patches:

* git-upstream[1] uses the strategy described above
* quilt[2] uses patch files saved in a source code repository
* StGit[3] is inspired by quilt and uses git commits to store patches
* MQ[4] is also inspired by quilt and implements a patch queue in Mercurial

[1] https://opendev.org/x/git-upstream
[2] https://savannah.nongnu.org/projects/quilt
[3] https://stacked-git.github.io
[4] https://wiki.mercurial-scm.org/MqExtension