On Thu, Feb 10, 2011 at 04:13:10PM -0500, John Wiegley wrote: > For the purposes of this discussion, I'd like to define the term "aggregate > identity" (insert better name here) as a set including: a commit's sha, and > zero or more shas stored in a new field named "Origin-Ids". > > If, when cherry-picking, the originating's commit id is stored in the > Origin-Ids field of the cherry-picked commit, then rebase could know whether a > given commit's changes had already been applied. The logic would look like > this: > > 1. When rebasing a branch A onto B, find the common ancestor of A and B. > 2. Examine every commit on B since that common ancestor, collecting a > set of their aggregate identities. > 3. For each commit on A, ignore it if its aggregate identity occurs in > that set. > > This would cause commit 3 to be ignored during the rebase above, since 3' > would have an origin id referring to 3. This can work in some cases, but there are other cases where it might not. For example, consider: 1. I cherry-pick commit X from some branch "topic" onto master as X'. We record "Origin-ID: X" in X'. 2. I rebase "topic" (either onto some other branch, or perhaps I use rebase -i to rewrite some earlier commit). X now becomes some X''. 3. I now rebase "topic" onto master. But we fail to note that X'' matches X, so we try to rebase it. Now, in step 2 we could record "X" as an origin ID of X'' and during the rebase in step 3, calculate the intersection of the origins of X'' and X', and see that they are both just X. And I think maybe you already realize that, since you talk about Origin-IDs as sets. But now you have an interesting question: during which operations does a commit retain its Origin-ID on a source? I think it is pretty clear that a cherry-pick that cleanly applies is probably a good candidate. But what if there is a conflict, and I fix up the conflict? Should I still skip the original commit during the rebase? Maybe, but there are cases where you wouldn't want to. For example, consider this sequence: 1. On my master branch, I have a function foo() which takes one argument. 2. I branch a topic from master. The first commit adds a new caller to foo(). 3. The second commit changes foo() to take two arguments. I fix up the function itself, any old callers, and the new caller. 4. I cherry-pick the second commit onto master. There is a conflict, since one of the callers it updates doesn't exist in master. So I drop that part of the patch. 5. On master I update the implementation of foo(). 6. Now I try to rebase my topic on top of master. We could get a conflict because the second commit from (3) will conflict with the updated implementation in (5). This is more or less the case you described in your initial email, and we'd like git to automatically realize that the conflict is uninteresting. So let's imagine we recorded Origin-IDs as you describe, and we skip it. But that means we are _also_ skipping the part where we update the new caller from the commit in (2), the part that was dropped during conflict resolution. So our end result is broken, because the new caller is still calling with one argument. And there are lots of other cases. What about "git cherry-pick -n"? What about rebasing? If there are no conflicts, is it OK to copy the origin field? How about if there are conflicts? How about in a "git rebase -i", where we may stop and the user can arbitrarily split, amend, or add new commits. How do the old commits map to the new ones with respect to origin fields? So there are lots of corner cases where it won't work, because git is more than happy to give you lots of ways to tweak tree state and history, and it fundamentally doesn't care as much about process as it does about the end states that you reach. That's part of what makes git so flexible, but it also makes niceties like "did I already apply this commit on this branch" much harder to make sense of. Now, I don't want to discourage you from working on this. Because while there are lots of cases where it won't work, there are plenty of cases where it _will_, and it will save rebasers time and effort. So it is worth pursuing, but I think it is also worth keeping things simple and conservative, and not affecting the people who have cases where this won't help. > - Extend commit objects to have an Origin field, which can be zero, one or a > list of hashes. It probably shouldn't be a new header field, but rather a text-style pseudo-header at the end of the commit. But consider for a moment whether you actually want this field in the resulting commit at all, or whether it should be an external annotation. For example, let's say I cherry-pick from a private branch that is going to end up rebased anyway. Now the history for all time will have a commit that refers to some totally useless sha1 that nobody even knows about. We already went through this with cherry-pick. It used to always put "cherry-picked from X..." in the commit message. And then we realized that in many cases, that information is not interesting, because X is not something people actually know about. So now we don't do it by default, but for cases where you are cherry-picking from one long-running branch to another, you can use "cherry-pick -x". So consider instead putting this information into a commit-note for the new commit. Possibly even reversing the direction of the mapping (so that the old commit says "I was cherry-picked to X"). And then when the old, rebased commit goes away, the note will automagically get pruned by the notes-pruning mechanism. There may be reasons why that isn't a good idea, and I haven't thought it through. But I think you should consider it as an alternate implementation and tell me why I'm dumb in that case. ;) > - Add an option to git commit so that one or more origin ids can be specified > at the time any commit is made. There may be occasions when it's useful to > explicitly state that a new commit should somehow 'override' the contents > of another during a rebase. > > - git cherry-pick and git am should add this Origin field, showing the commit > their contents originated from. We already have this to some degree, in the form of "cherry-pick -x". You could do it with "git am", but you would need "git format-patch" to actually generate the information (well, technically speaking it is in the mbox "From " header, but that usually doesn't make it through mail transports for obvious reasons). So I wonder if your proposal can be restructured as: 1. Change rebase to look for cherry-picked-from headers on the --onto side, and skip source commits that appear to exist already. That will start helping people immediately using existing history. You can also deal with uncertainty by leaving this decision to the last minute, or even leaving it up to the user. The usual patch-id detection works in a lot of cases. Let it work when it does. When it fails, check if the conflicted commit exists in a cherry-picked-from line. If it does, either do the skip then, or when we barf with the "there was a conflict; fix it up and rebase --continue" message, mention the cherry-picked-from line and let the user inspect the commits themselves and make a decision. They can always do "git rebase --skip" even now, so all we are really doing is saying "By the way, you might want the extra information that this was cherry-picked earlier". And that makes this a very low-risk change, since we are just giving the user extra information for a decision they are already making. 2. (Optional) Start adding the "Cherry picked from" message in a more machine-readable format, like an "Origin-ID: ..." header. This has already been discussed before. People were generally positive, but it didn't seem especially useful. This is a use. And obviously make the corresponding change in rebase to also parse these kinds of headers (but don't drop parsing the original format, obviously, for compatibility). 3. For people who don't want the "cherry picked from" (or "origin-id") in their commit, because they are cherry-picking from a private source, start recording "cherry picked from" in a git-note. You could even do this by default, since you are not impacting the commits themselves in any way. And then make the corresponding change in rebase to start using these notes as a source. 4. We already have some functionality to copy notes about commit A to commit B during certain operations (like rebasing and cherry-picking). Check out how these interact with the notes introduced in (3) to see if transitive stuff works (like cherry-picking A to A', and then A' to A''; you should still be able to figure out that A'' came from A). And I think at that point we have more or less the functionality you were asking for, though we arrived in several non-controversial steps. And there are lots of enhancements you could add on top, like skipping without bothering the user about it, or better heuristics for when to record an origin-id or not to. But we can do those once we see how the basic dumb part performs. I.e., how useful it is in practice, and how often it is wrong about when to skip. > - git merge --squash would store the commit ids, and the origin ids, of every > commit involved in the merge into the resulting commit's Origin field. I hadn't thought about merge --squash as a commit copying operation, but I think it is. I wonder if squash merges (or squash rebases) should also be copying notes (or if they do already, I haven't checked). > - git log could be extended to show the "parentage" (really, the aunt/uncle) > of commits with origin info, assuming those origin commits are not dangling > (which is OK, and likely to occur after the originating branch is deleted, > or if the originating branch is in another repository). If you do it with a combination of text in the commit message and git-notes, then this is all done for you. The commit message you obviously see by default, and you could explicitly ask for it to show the refs/notes/origin-id notes tree. Whew, that turned out long. I hope it's helpful. I think the problem you're trying to solve is a real one, and I think your approach is the right direction. I just think we can leverage existing git features to do most of it, and because it is sort of a heuristic, we should be conservative in how it's introduced. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html