On Mon, Feb 02, 2009 at 07:48:53PM +0100, Jakub Narebski wrote: > In my opinion the most important issue is concentrating on "container > identity" instead of on the underlying issue of renames in version > control, which includes intelligent, rename-aware merge; talk about > issues and not about possible solution. I will concentrate on this > issue for now, and leave for example issue of workflows, and of VCS > history for possible later posts (it is long enough as is). This was discussed in no small amount of detail on the mailing list uvc-reviewers, which used to be hosted here: http://thyrsus.com/mailman/listinfo/uvc-reviewers Unfortunately, it looks like Eric has taken down his mailman instance on thyrsus.com. I have personal archives of the list, and the list used to be have public archives, so I don't feel any hesitation sharing it with interested parties. > Below you can find my comments; quoted fragments of "Understanding > Version Control' essay are prefixed with 'UVC> '. 'TODO' refers to > http://www.catb.org/esr/writings/version-control/TODO.html > > Please do participate in this discussion, especially if you have > something to say with respect to rename detection versus rename > tracking issue. Thanks in advance. Heh. A lot of this has been said already. I think one of the reasons why Eric kept things short in his paper, and did *not* say a lot about whether or not container identity tracking was fundamentally needed or not was because we didn't come to any real consensus on the uvc-reviewers mailing list. I believe it is extremely difficult to do so given that it's very hard to avoid the slippery slope of advocating for one SCM system versus another. I'll include some of my writings on the subject from the uvc-reviewers mailing list so folks can see where some of this discussion went last time... (All of this dates from January, 2008, when Eric was last aggressively updating the paper in question.) BTW, when I referred to SCM's being a horrible hack and "guessing" and "fit only to be used by amateurs" if they didn't record function-level identity tracking, there were those who were seriously arguing that any SCM (i.e., like git) that didn't track container identity was fundamentally a "hack". Yes, there are people who seriously take that view, some of which were very bitter that their DSCM didn't win the market/popularity wars, and so their pet projects overtaken by SCM's such as git, describing $THEIR_PET_SCM_WITH_PROVABLY_CORRECT_SEMANTICS as Betamax, and git as VHS. The argument that without rename-tracking, if git was used to development an software for Air Traffic Control application, airplanes could be dropping out of the sky was also made by these advocates, no kidding. (So was the argument that using a DSCM that didn't do container identity tracking might be considered Programming Malpractice.) So be careful about wanting to reopen this discussion; if the some of the wrong people join in, you may be very sorry! :-) - Ted > Here then are some types of identity > tracking one might imagine: > > * File identity tracking: tracks the identity of a file through > renames and moves. > * Simple file content tracking: tracks the identity of content > using adds and deletes within a single file. (Note, there is a > question that could be asked here about the resolution of the > tracking. Most current systems that track do so on a line-by-line > basis, but one could imagine tracking bytes. I wont say any more > about this in this email.) > * Movement of content within a file: tracks the identity of > content within a file when lines are moved. > * Movement of content between files: tracks the identity of > content when lines are moved between files. One obvious one which isn't in this list is "Directory Identity Tracking", that is when you move a directory, new files which are created in one branch at the original directory location will be moved when you merge with another branch where the directory has been moved. In private conversations with Tom Lord, he tells me that he had also played with the concept of "Function/variable (more generally, programming structure identifier) identity tracking". That is, suppose you had an editor like Eclipse which has as a primitive, "Rename Java identifier (class/method function/variable)", and this information was passed into the SCM so it could be tracked. Then in one branch, a Java identifier could be renamed, and then in another branch, the use of that same Java identifier could be added in 20 different places --- and since the SCM knows, at a deep semantic level, that a rename had taken place in Branch A, when it is merged with Branch B, it could DTRT and change the newly added uses of the renamed identifier to its new name. And like with Directory Identity Tracking, it's not hard to come up with scenarios where without this level of tracking, something horribly wrong could take place as a result of the SCM not tracking function identities and using them when doing merges. At the very least, the program would fail to compile, and if the example involved an Air Traffic Control system and multiple function renames taking place, you could even come up with a contrived horror scenario where planes would be falling out of the sky --- that is, if you ignore regression testing, and simple coding practices that would prevent something like this from happening. Of course, the flip argument by people who are trying to promote their brand-spanking new SCM that did function identity tracking (FIT) is critical since SCM's are all about ACCOUNTING, and without FIT, systems that try to merges are just GUESSING, and *obviously* a system which did FIT is far superior to a SCM that didn't; in fact, a SCM that didn't do FIT is just a Horrible Hack Done By Amateurs. Furthermore, using an SCM without this feature would (according to promotors of this hypothetical new SCM), be Programming Malpractice. And if this sounds silly, I'm just repeating the exact same arguments that proponents of systems like arch, Bk, et.al, which store the user intention information of file and directory renames, have recently advanced against git since it doesn't store this sort of information. (It may reconstruct rename information in a lazy fashion when it is needed, but it doesn't store it.) But if file and directory renames is a type of user intention which MUST be stored in order for an SCM not to be a hack, why not function, variable, and class renames? That too would be another type of user intention. - Ted --------------------------- > The second could be called "location". Which file should this patch > be applied to? Which lines within a file should this hunk be applied > to? I argue in [2] that Darcs does strictly better at the task of > location than do SVN or GNU diff3. (I think that SCCS, BK and CDV do > as well, but I don't understand them well enough to be sure.) I > argue that Darcs does strictly better in the sense that its answers > to the location question are often better and never worse, and that > it does so *not* by having a more sophisticated heuristic or by > getting lucky more often, but by a simple, provably-correct algorithm > which uses valuable information that other algorithms overlook. How are you defining "provably correct"? In order to show correctness, you need to define what correctness means. One approach is that you force the user to tell you --- and if you are in the middle of applying a series of 500 patches, you throw up the Annoying GUI Dialog Box which stops the application of patches dead in its tracks, and force the user to confirm whether this is a rename, or a delete followed by an add of remarkably similar content. Or, if patch removes all the files from one directory, and created them all in another directory, that what was the user intention was a directory rename, and the SCM records it as such. Here, you are *assuming* that what the user tells you is correct, and that's part of the lemma you use for proving correctness. If the user, who is seriously annoyed at the popup boxes, says, "Yeah, yeah, yeah", and dismisses the dialog box without changing the defaults (which were selected via a hueristics and which were wrong), well, it's not the SCM's fault since the user told it what it wanted, and the user was wrong. GIGO. In your case, you're saying that Darcs is using "valuable information" that other algorithms overlook. OK, so the Darcs people were more clever about designing a hueristic which tries to approximate user intention, and having designed the hueristic which uses said "valuable information" you can prove whether or not Darcs' algorithm correctly implemented said hueristic. But at the end of the day, it's still a hueristic, and the use of words "provably correct" is just a semantic trick. Even svn's lack of directory rename support could be considered "provably correct", if the definition of "correct" is an algorithm which determinically creates new files in their original location, even if all other files in that directory were deleted and new files with the same name and same content were created somewhere else. It's still an algorithm, and you can prove whether or not it meets its design specs, but to the extent that it is less likely to approximate user intenions, people would say that svn might be less useful in such cases than some other SCM. > Let's be careful not to lump these three things together, say "All > merging involves guessing.", and thus overlook the interesting fact > that some merge algorithms involve strictly more guessing than do > others. Part of the problem is that words like "guessing" and claims of some algorithm being "provably correct" are basically marketing words. They are generally used to denigrate one SCM, and promote someone's favorite SCM as being **better**. Fundamentally, the goal of merging is to Do The Right Thing --- from a semantic point of view, which means that the user's intentions is what's important at the end of the day. The question then is whether you record the user's intentions, or try to determine it in from a hueristics point of view. The people who claim that recording the user's intentions is superior will claim that you can never know for sure what the user meant, so you have to ask him or her to provide that information. In some cases that's relatively easy; you require the user to use commands like "bk mv" and "bk cp" and "bk rm" which not only performs the function, but also records the user's intention. Unfortunately, if you are applying a patch, and the patch file hasn't been enhanced to carry this kind of information, you have to use hueristics and then somehow get the user to confirm them --- hence the use of the Annoying Popup Dialog Box. In other cases, you can't determine it easily, such as the "rename a Java method function" case, unless you have a specialized editor which has this as a primitive, or, alternatively, even more Annoying Dialog Boxes that pop up as you try to commit a change. Once you have recorded this information, using it in the merge is relatively easy --- or if not easy, at least relatively easy to specify and then show whether or not the information was used correctly. What then tends to happen is that people whose SCM does one kind of user intention recording (such as file renaming) will use this as a huge club to say their system is better than another system, and that a system which doesn't record this information is "Guessing". They will also say that their system is "Provably Correct". So both of these words are really Red Meat Marketing words, which get used when people try to say that Their SCM Is Superior. On the flip side, the people who don't do any recording of this information will point out that trying to record the information is hard, especially when changesets go through a lossy medium, such as patch files which are e-mailed around, which doesn't record this kind of user intention. This may or may not applicable for a particular project, but for some projects, it is extremely important, especially for those where e-mail is the primary communication channel and how patches get reviewed and passed around. The other point which people on "we don't record user intentions; we just record content" tend to use is that you can always add more hueristics later, but in practice, if you didn't record the user intention when you made the commit, it's almost impossible to add it later. So for example, suppose git, which doesn't have function rename support today, has a new merge engine added later which works specifically for Java files that correctly intuits and deals with method function renames. (Maybe some crazy Java programming methodology does this all the time, so people get motivated to write such a thing.) A SCM which works on recorded user intentions will need to add that support, in a way which doesn't break backwards compatibility of their distributed repository, *OR* accept the fact that for function renaming, it will have to use hueristics that are run at merge time. So this seems to be fundamentally a tradeoff. The two *objective* things that can be said is how many user intentions are recorded at commit time (file rename? directory rename? function/variable rename?), and what sort of information is used at merge time via hueristics to determine user intention. And if you want to call those hueristics an "algorithm" because it sounds more mathy and provable, sure, whatever. The fact is, the algorithm is still an approximation on trying to determine user intent, with the goal of making the merge do the right thing from a Semantic Point of View. * * * >From a project point of view, how often you actually *do* merges is an interesting one. If merges requiring complicated user intention tracking is necessary don't happen very often, maybe focusing on this issue to the extent that SCM geeks tend to do isn't very productive. In my opinion, the overall usability of the system is *far* more important. In the OSS world, every project is competing for programmers, and if the tool is easier to use, the more likely it will be that you can get people to contribute to your project. Even if they aren't bright enough to work on the core algorithms for your project, they could at least improve the documentation. That was what drove my decision to switch to Mercurial back in 2005. Last year, I moved my projects to git because of various features that worked well with my development workflow and which generally improved programmer productivity. Whether or not file or directory renames were being tracked in the SCM had *no* bearing on my decision to switch, because merges happen rarely, and I have an extensive regression test suite which I run after almost every commit, and *definitely* after every merge. So if a merge doesn't do the right thing, that's OK; I'll fix it up, and use "git commit --amend" correct the merge commit. And yet, people seem to focus on recording of user intention because it reflects some holy grail of Perfection and Correctness. And maybe, because it is easily measurable, whereas usability and improving programmer productivty are inherently more subjective measures. What's very sad are the people who are feel profoundly hurt that they spent a huge amount of their life working on SCM Correctness, only to find that people chose other SCM's based on other metrics and other issues other than the one that they felt was most important. Unfortunately there's not a lot that can be done about that. - Ted -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html