[RFC] Replace rebase with filtering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Blanket request, so I don't have to keep repeating it: Please correct me if I'm wrong about anything below. I'm more familiar with git than I was a month ago, but still not an expert, so I could be totally off (re)base here. With that note of confidence...

We have a setup like this:

   (external)
       |
  local master
       |
  integration
 /     |     \
dev1   dev2  dev3

We pull changes from the external repository (actually a Subversion repo) into a local master. The integration repo is a clone of that. That's our local setup, but the particulars don't matter here -- I'm just using it as an example.

Ideally, we'd rebase the integration area against changes pulled from master, then each dev repository would rebase against the changes from the integration area. That would keep our histories nice and clean as we pull changes down from the external repository.

But of course rebase will get confused and we'll end up re-applying changes in the dev sandboxes as soon as there are any existing change in the integration repo when we pull changes from master, because rebase will turn those existing changes into new revisions that don't match any previously known ones in the dev repositories.

So at the moment, as far as I can see, the only option is to use merge rather than rebase everywhere but the leaf nodes of our repository tree, and just live with the cluttered history. The developers will at least have clean *local* histories, but they'll be rebasing onto a cluttered history from the integration repo.

However, they may not want to, even if they can: as soon as I rebase, unless everyone is very careful, I have just prevented other developers from pulling my local commits into their local repositories before I've pushed my stuff up to the integration area. Sibling-to-sibling pulls -- which I hope we all agree are a very useful feature of systems like git -- have exactly the same rebase problem as parent-to-child pulls: you'll end up re-applying the same changes if the target repo had an earlier version of a newly-rebased chain of commits. So even in our development repos, I suspect we'll want to avoid rebasing unless we're certain we won't ever need to share changes directly with each other, and just live with the clutter.

All of which made me think, gee, it'd sure be nice if there was a way to filter out those excess merges when we view our branch history. I think all it would take would be to mark a merge commit as a rebase-ish update (rather than an actual integration where the merge itself is an event that's important to us) and you could, if the user chose, discard those merges from views of the branch history.

And then it occurred to me: if we had that, would we actually need rebase at all? As far as I know, rebase is all about aesthetics, not functionality; the reason you rebase instead of merge is that you don't want to wade through zillions of irrelevant merges when you browse your project's history. But if those merges are simply not shown to you, it shouldn't matter that they exist. Yes, you will have more objects in your index, but with git's delta compression, you might not even notice the difference.

Rebase has an interesting undesirable property aside from messing up downstream clones when they try to pull your latest changes. Since rebase preserves the timestamps on your local commits, you almost always end up with a situation where the history says almost all your local commits happened before the commit they claim to have been branched from. That's not too big a problem in most cases, but it sure isn't very clean. The underlying problem is that when you rebase you lose not only the history of your intermediate updates, but also the history of your original branch creation.

Filtering rather than rewriting history would fix all of that. We could easily report that branch XYZ was forked from branch ABC on 01/03/2006 and is up to date with all of ABC's changes up to 01/18/2006 (i.e., display information about the initial branch and the most recent update merge, but none of the ones in between.) The timestamps all make sense because you haven't lost the history. And if someone has cloned your repository, they can keep pulling updates without anything breaking.

It might also, be possible to implement an after-the-fact rebase to reduce the number of excess commits: a command that rebases all the update merges older than a certain age, on the theory that you can usually put an upper bound on how out-of-date someone's clone of your repo is allowed to get. That would rewrite just the ancient history, not touching anything recent, and would mark the newly-created revisions such that pull could skip fetching them if the target already has the more recent revisions. (The revision at which you stop rebasing should end up with the same revision ID as the one in the actual history, since the contents should match.) If that's not clear I can draw a picture. Haven't thought that bit through too much so it might not be feasible in the end, but it seems like in theory we have all the information we need to resolve conflicts, etc.

Comments? Am I fundamentally misunderstanding how rebase works and/or why the documentation warns people away from using it in repos that might be pulled from?

-Steve

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]