I've spent a month thinking about the design of git. This is the result of that work. My understand of git is that, at the lower level, it stores and communicates snapshots of a file system. At the upper level, git manipulates "edit"s: the change between two snapshots. I'm proposing: 1) Creating git commands that mimic Unix's file system commands and operate on those snapshots. 2) A language for describing git's manipulations of edits 3) Creating aliases that will allow the working tree and state stored in the index to be treated like file systems I want to thank Jeff King ("Peff") for being my sounding board for the theory of edits. 1) Snapshots 2) A Theory of Edits 3) File Systems 3.1) Merge Conflict 4) Conclusion 1) Snapshots Commits consist of: 1) A snapshot of the file system 2) Some meta-information 3) link(s) to past commit(s) I'm only concerned about #1 here. The way to make something both easy to learn and easy to remember is to imitate something the user is already familiar with. Thus my proposal is: PROPOSAL 1: git should imitate the Unix file system commands for accessing the snapshot of a commit. For these commands to work, the git command will have to include an argument that specifies which commit it operates on. So some basic ones might be: "git ls <commit> -- <path>" "git cat <commit> -- <path>" (There exists "git ls-files", "git ls-tree", and "git cat-file" but they are not quite the same.) "git find <commit> -- ..." "git grep <commit> -- <path>" (Exists) The Unix command "diff" compares two files/directories. So, the "git" version requires two commits to be specified. "git diff <commit> <commit> -- <path>" (Exists) I'd love to see something to apply a command to every file in a commit or every file found by "git find". "git xargs <commit> ..." (Is this possible?) Since snapshots are a read-only version of a file system, git can't implement the commands "rm", "mv", or "cp". 2) A Theory of Edits Snapshots in the object store are easy to understand. But to go further, we need to be able to look at the index file and the working tree. During a merge conflict, the index file and working tree go into an unusual state. To understand that state, I spent a month coming up with a mathematical theory to describe what git does. Git, at the upper level, manipulates edits. An "edit" here is a very specific term. It is the changes between two specific snapshots. I'm going to introduce some notation: if two snapshots are A and B, then the edit is written A:B. ============== WARNING: Great boredom ahead. Mathematicians and theoriticians: enjoy! Everyone else: read as much as you can and when you start to fall asleep, skip to the next subsection. ============== Edits have mathematical properties. It's easy to see that B:A is the "inverse edit" of A:B. The edit A:A is the "empty edit" for snapshot A. An edit A:B can be "split" using a snapshot C to make edits A:C and C:B or, written another way, A:C:B. Likewise, edits A:B and B:C can be "joined" to form A:C. * The inverse edit is generated by "git revert". * The empty edit can be written by "git commit --allow-empty". * Splitting an edit will be demonstrated later using "git add". * Joining an edit is done "git commit --amend" An edit has a specific start and end snapshot. If you want to do a similar change with a different starting snapshot, you need to "patch"; patch() is a function that takes an edit and a new starting state and returns the ending snapshot of a new edit. So, patch(A:B, C) may return D, where C:D is a new edit containing a change similar to A:B. I say "may return" because a patch starting at snapshot C might not exist. For example, if the edit A:B moves file "foo.txt" to "bar.txt" and snapshot C does not have a file "foo.txt" nor "bar.txt", then the patch cannot exist. [Note, there can be many definitions of a patch() function. I'm not picking one; I'm just saying one exists.] * patch() is most easily seen in "git cherry-pick" The last definition concerns reordering edits A:B and B:C. The edits are reorderable if a patch of B:C can put in front of a patch for A:B and the resulting edit still ends up at the same final snapshot C. Formally, A:B:C is "reorderable" if there exists A:D:C such that patch(B:C, A) = D and patch(A:B, D) = C. "git rebase --interactive" can reorder (and do anything else!) PROPOSAL 2: adopt a term like edit and rigorous terms like split, join, and reorder to describe the operations of git commands. We should also use exacting vocabulary to describe git commands. It's not unusual to use the word "commit" when referring to: * a snapshot (stored in the commit's tree object) * an edit (the difference between this commit's snapshot and its parent's (if it has only one parent...)) * a complete history of edits going back to the initial snapshot * the commit object itself (e.g., when tagged) While often the appropriate definition can be picked up from context, we should be precise if possible. It would be good to define a term like "snapshot tree" that refers to a tree object that is the root of a snapshot, to differentiate it from other tree objects that store subdirectories. 3) File Systems : There exist snapshots outside the object store! This statement may surprise you: The current state of the working tree is a snapshot. The working tree is a file system so its state at any one point, is a snapshot of a file system. For brevity, I'm going to call that snapshot WTREE. We can talk about the edit between any other snapshot (like ones stored in a commit) and WTREE. Usually, we'll talk about the edit from HEAD to WTREE. If the edit from HEAD to WTREE contains more than one feature and we want to package each feature into its own edit, we need to split the HEAD to WTREE edit. And, to split an edit, we need another snapshot... Another surprising statement: A snapshot can be computed from the index file and HEAD (when not in merge-conflict state). My validation for this statement is that at any point, we can type "commit --allow-empty" and a new commit will be written to the current branch. That commit contains a snapshot generated from the index file and HEAD. Since the snapshot computed from the index file and HEAD will become the next commit written, I'll refer to it as NEXT. I want to be clear that NEXT is not the index file. NEXT is a snapshot. The index file is a file. The index file (with HEAD) is just one way to store a snapshot. Since we can modify the files that will go in the next commit, NEXT is actually a file system like WTREE. Although the man page for "git add" says it "add[s] file contents to the index", I think a better way to say it is that it copies the files into the NEXT file system. To recap: a common operation in git is to split the HEAD to WTREE edit by "git add"ing files to the NEXT file system and then using "git commit" to write a snapshot of NEXT into the object store, making the edit permanent. (You may want to reread that sentence a few times until it becomes clear.) Now, the concepts of WTREE and NEXT work most of the time. However, when there is a merge conflict, the index file takes on a special state. This is why I developed a theory around edits: I need it to describe what happens then and why. 4) Merge Conflict I'm going to use "git cherry-pick" for my example. It involves merging a single edit, so it's the easiest case. A cherry-pick is almost a direct application of patch(). We have an edit A:B and we want to move it onto snapshot C. But we said earlier, the result of a patch() function may or may not exist. If patch(A:B, C) exists and equals D, then git just writes the snapshot D as the new commit. But what if patch(A:B, C) does not exist? Git does something amazing: it splits A:B! So, we'll introduce a new state S to get A:S:B. Now, the first edit, A:S, contains all the parts of A:B that can be patch()ed onto state C, and the second edit, S:B, contains all the parts of A:B that cannot be patch()ed onto state C. Obviously, patch(A:S, C) exists and the resulting snapshot is copied into NEXT. But what happens to the unpatchable part in edit S:B? We don't want this change thrown away - it could be important. We want it presented to the user and let the user fix or dismiss it. If we had a GUI, the window's border might turn red and tabs for each affected file might open, but a command-line interface doesn't have that. So, git writes something reflecting the unpatachable part into files in the working tree and marks the files as "needs review" in the index file. (It also caches some files SHAs in the index, but we can ignore that.) Now that we know what happens during a conflicted merge, the question is: do there exist any snapshots here? We defined NEXT using the snapshot that would go in the next commit. But if you run "git commit --allow-empty" during a merge conflict, you get an error! We said the current working tree state was a snapshot, but git just wrote "<<<<""===="">>>>" into the files - if they did obey any syntax before that, they certainly don't now! The answer is unclear. My opinion is that there's little harm in viewing the result of patch(A:S,C) as NEXT. (This would be "the snapshot generated using stage 0 of the index file" to some.) I also think that, during a merge conflict, the working tree is in an unsyntactic, unnatural state. While I think there is value to always treating the working tree as a file system, I can understand with those who might argue that git should treat it differently during a merge conflict. PROPOSAL 3: Add aliases NEXT and WTREE that work in place of a snapshot in any commands. e.g., "git diff HEAD NEXT" e.g., "git ls NEXT etc/" During a conflicted merge state, we _may_ want commands to treat WTREE differently. ADDITION TO PROPOSAL 2: Since NEXT and WTREE are writeable file systems, the Unix filesystem commands that write should be implemented as part of git to work with them. "git cp <snapshot> <writeable_filesys> -- <src_path> <dest_file>" "git mv <snapshot> <writeable_filesys> -- <src_path> <dest_file>" "git rm <writeable_filesys> -- <file>" I believe "git cp" would be similar to the proposed "git put". The current "git mv" and "git rm" does operation on both NEXT and WTREE by default. (Which I think is a sensible default in those cases.) We may want to consider "mkdir", "rmdir", "chmod". 4) Conclusion I've proposed that git give snapshots the same interface that a Unix command line would provide to a read-only file system. I've presented a mathematical language for defining edits and proposed using it and clearer words for describing git command's operations. I've proposed creating the aliases WTREE and NEXT and allowing them to be used anywhere a snapshot is used and in Unix commands that operate on a writeable filesystem. Mike Nahas Appendix: Features and Edits I said above "If the edit from HEAD to WTREE contains more than one feature and we want to package each feature into its own edit, ...". I believe this concept of one-feature=one-edit is at the heart of good git usage. We want to manipulate features - add a feature to a branch, remove a feature, etc. - but we can only manipulate edits. So, as long as each feature is in its own edit, we can easily manipulate it. Unfortunately, features and edits are not the same. We can merge two edits, but that doesn't mean the result has both features. For example, consider a project that branches and one branch's feature is a new Makefile target and the edit explicitly lists the source files. The other branch's feature is support for a new protocol and that edit adds a new source file. The merge of these two branches may succeed and contain both edits, but it doesn't have both features. The git glossary calls a merge "evil" if the commit contains a change that is not present in either parent. I say that's a bad definition. In the example above, I think it's a good thing to edit the merge so that the commit contains both features. This is why I think this theory of edits has value: we want to manipulate features and if we have one-feature=one-edit and we know how to manipulate edits, then we can manipulate features. I don't think it is a new concept; I think it has been implied any number of places; I just hope with clear terms to describe manipulation of edits like split/join/inverse/patch/reorder that we have clearer description of what we do. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html