File Systems and a Theory of Edits

Michael Nahas <mike.nahas@xxxxxxxxx> · Sat, 30 Jul 2011 10:29:36 -0400

I've spent a month thinking about the design of git.  This is the
result of that work.

My understand of git is that, at the lower level, it stores and
communicates snapshots of a file system.  At the upper level, git
manipulates "edit"s: the change between two snapshots.

I'm proposing:
    1) Creating git commands that mimic Unix's file system commands
       and operate on those snapshots.
    2) A language for describing git's manipulations of edits
    3) Creating aliases that will allow the working tree and state
       stored in the index to be treated like file systems

I want to thank Jeff King ("Peff") for being my sounding board for the
theory of edits.

1) Snapshots
2) A Theory of Edits
3) File Systems
3.1) Merge Conflict
4) Conclusion

1) Snapshots

Commits consist of:
    1) A snapshot of the file system
    2) Some meta-information
    3) link(s) to past commit(s)

I'm only concerned about #1 here.

The way to make something both easy to learn and easy to remember is
to imitate something the user is already familiar with.  Thus my
proposal is:

PROPOSAL 1: git should imitate the Unix file system commands for
accessing the snapshot of a commit.

For these commands to work, the git command will have to include an
argument that specifies which commit it operates on.  So some basic
ones might be:
    "git ls <commit> -- <path>"
    "git cat <commit> -- <path>"
(There exists "git ls-files", "git ls-tree", and "git cat-file" but
they are not quite the same.)
    "git find <commit> -- ..."
    "git grep <commit> -- <path>" (Exists)
The Unix command "diff" compares two files/directories.  So, the "git"
version requires two commits to be specified.
    "git diff <commit> <commit> -- <path>"   (Exists)
I'd love to see something to apply a command to every file in a commit
or every file found by "git find".
    "git xargs <commit> ..."  (Is this possible?)
Since snapshots are a read-only version of a file system, git can't
implement the commands "rm", "mv", or "cp".

2) A Theory of Edits

Snapshots in the object store are easy to understand.  But to go
further, we need to be able to look at the index file and the working
tree.  During a merge conflict, the index file and working tree go
into an unusual state.  To understand that state, I spent a month
coming up with a mathematical theory to describe what git does.

Git, at the upper level, manipulates edits.  An "edit" here is a very
specific term.  It is the changes between two specific snapshots.  I'm
going to introduce some notation: if two snapshots are A and B, then
the edit is written A:B.

==============
WARNING: Great boredom ahead.  Mathematicians and theoriticians:
enjoy!  Everyone else: read as much as you can and when you start to
fall asleep, skip to the next subsection.
==============

Edits have mathematical properties.  It's easy to see that B:A is the
"inverse edit" of A:B.  The edit A:A is the "empty edit" for snapshot
A.  An edit A:B can be "split" using a snapshot C to make edits
A:C and C:B or, written another way, A:C:B.  Likewise, edits A:B and
B:C can be "joined" to form A:C.

    * The inverse edit is generated by "git revert".
    * The empty edit can be written by "git commit --allow-empty".
    * Splitting an edit will be demonstrated later using "git add".
    * Joining an edit is done "git commit --amend"

An edit has a specific start and end snapshot.  If you want to do a
similar change with a different starting snapshot, you need to
"patch"; patch() is a function that takes an edit and a new starting
state and returns the ending snapshot of a new edit.  So, patch(A:B,
C) may return D, where C:D is a new edit containing a change similar
to A:B.  I say "may return" because a patch starting at snapshot C
might not exist.  For example, if the edit A:B moves file "foo.txt" to
"bar.txt" and snapshot C does not have a file "foo.txt" nor "bar.txt",
then the patch cannot exist.  [Note, there can be many definitions of
a patch() function. I'm not picking one; I'm just saying one exists.]

    * patch() is most easily seen in "git cherry-pick"

The last definition concerns reordering edits A:B and B:C.  The edits
are reorderable if a patch of B:C can put in front of a patch for A:B
and the resulting edit still ends up at the same final snapshot
C. Formally, A:B:C is "reorderable" if there exists A:D:C such that
patch(B:C, A) = D and patch(A:B, D) = C.

    "git rebase --interactive" can reorder (and do anything else!)

PROPOSAL 2: adopt a term like edit and rigorous terms
like split, join, and reorder to describe the operations of git
commands.
We should also use exacting vocabulary to describe git commands.  It's
not unusual to use the word "commit" when referring to:
    * a snapshot  (stored in the commit's tree object)
    * an edit   (the difference between this commit's snapshot and its
                   parent's (if it has only one parent...))
    * a complete history of edits going back to the initial snapshot
    * the commit object itself (e.g., when tagged)
While often the appropriate definition can be picked up from context,
we should be precise if possible.
It would be good to define a term like "snapshot tree" that refers to
a tree object that is the root of a snapshot, to differentiate it from
other tree objects that store subdirectories.

3) File Systems : There exist snapshots outside the object store!

This statement may surprise you: The current state of the working tree
is a snapshot.  The working tree is a file system so its state at any
one point, is a snapshot of a file system.  For brevity, I'm going to
call that snapshot WTREE.  We can talk about the edit between any
other snapshot (like ones stored in a commit) and WTREE.  Usually,
we'll talk about the edit from HEAD to WTREE.

If the edit from HEAD to WTREE contains more than one feature and we
want to package each feature into its own edit, we need to split the
HEAD to WTREE edit.  And, to split an edit, we need another
snapshot...

Another surprising statement: A snapshot can be computed from the
index file and HEAD (when not in merge-conflict state).  My validation
for this statement is that at any point, we can type "commit
--allow-empty" and a new commit will be written to the current branch.
That commit contains a snapshot generated from the index file and
HEAD.  Since the snapshot computed from the index file and HEAD will
become the next commit written, I'll refer to it as NEXT.

I want to be clear that NEXT is not the index file.  NEXT is a
snapshot.  The index file is a file.  The index file (with HEAD) is
just one way to store a snapshot.  Since we can modify the files that
will go in the next commit, NEXT is actually a file system like WTREE.
Although the man page for "git add" says it "add[s] file contents to
the index", I think a better way to say it is that it copies the files
into the NEXT file system.

To recap: a common operation in git is to split the HEAD to WTREE edit
by "git add"ing files to the NEXT file system and then using "git
commit" to write a snapshot of NEXT into the object store, making the
edit permanent.  (You may want to reread that sentence a few times
until it becomes clear.)

Now, the concepts of WTREE and NEXT work most of the time.  However,
when there is a merge conflict, the index file takes on a special
state.  This is why I developed a theory around edits: I need it to
describe what happens then and why.

4) Merge Conflict

I'm going to use "git cherry-pick" for my example.  It involves
merging a single edit, so it's the easiest case.

A cherry-pick is almost a direct application of patch().  We have an
edit A:B and we want to move it onto snapshot C.  But we said earlier,
the result of a patch() function may or may not exist.

If patch(A:B, C) exists and equals D, then git just writes the
snapshot D as the new commit.

But what if patch(A:B, C) does not exist?  Git does something amazing:
it splits A:B!  So, we'll introduce a new state S to get A:S:B.  Now,
the first edit, A:S, contains all the parts of A:B that can be
patch()ed onto state C, and the second edit, S:B, contains all the
parts of A:B that cannot be patch()ed onto state C.  Obviously,
patch(A:S, C) exists and the resulting snapshot is copied into NEXT.

But what happens to the unpatchable part in edit S:B?  We don't want
this change thrown away - it could be important.  We want it presented
to the user and let the user fix or dismiss it.  If we had a GUI, the
window's border might turn red and tabs for each affected file might
open, but a command-line interface doesn't have that.  So, git writes
something reflecting the unpatachable part into files in the working
tree and marks the files as "needs review" in the index file.  (It
also caches some files SHAs in the index, but we can ignore that.)

Now that we know what happens during a conflicted merge, the question
is: do there exist any snapshots here?  We defined NEXT using the
snapshot that would go in the next commit.  But if you run "git commit
--allow-empty" during a merge conflict, you get an error!  We said the
current working tree state was a snapshot, but git just wrote
"<<<<""===="">>>>" into the files - if they did obey any syntax before
that, they certainly don't now!

The answer is unclear.  My opinion is that there's little harm in
viewing the result of patch(A:S,C) as NEXT.  (This would be "the
snapshot generated using stage 0 of the index file" to some.)  I also
think that, during a merge conflict, the working tree is in an
unsyntactic, unnatural state.  While I think there is value to always
treating the working tree as a file system, I can understand with
those who might argue that git should treat it differently during a
merge conflict.

PROPOSAL 3: Add aliases NEXT and WTREE that work in place of a
snapshot in any commands.
     e.g., "git diff HEAD NEXT"
     e.g., "git ls NEXT etc/"
During a conflicted merge state, we _may_ want commands to treat WTREE
differently.

ADDITION TO PROPOSAL 2: Since NEXT and WTREE are writeable file
systems, the Unix filesystem commands that write should be implemented
as part of git to work with them.
    "git cp <snapshot> <writeable_filesys> -- <src_path> <dest_file>"
    "git mv <snapshot> <writeable_filesys> -- <src_path> <dest_file>"
    "git rm <writeable_filesys> -- <file>"
I believe "git cp" would be similar to the proposed "git put".  The current
"git mv" and "git rm" does operation on both NEXT and WTREE by default.
(Which I think is a sensible default in those cases.)
We may want to consider "mkdir", "rmdir", "chmod".

4) Conclusion

I've proposed that git give snapshots the same interface that a Unix
command line would provide to a read-only file system.

I've presented a mathematical language for defining edits and
proposed using it and clearer words for describing git command's
operations.

I've proposed creating the aliases WTREE and NEXT and allowing them to
be used anywhere a snapshot is used and in Unix commands that operate
on a writeable filesystem.

Mike Nahas

Appendix: Features and Edits

I said above "If the edit from HEAD to WTREE contains more than one
feature and we want to package each feature into its own edit, ...".
I believe this concept of one-feature=one-edit is at the heart of good
git usage.  We want to manipulate features - add a feature to a
branch, remove a feature, etc. - but we can only manipulate edits.
So, as long as each feature is in its own edit, we can easily manipulate
it.

Unfortunately, features and edits are not the same.  We can merge two
edits, but that doesn't mean the result has both features.  For
example, consider a project that branches and one branch's feature is
a new Makefile target and the edit explicitly lists the source files.
The other branch's feature is support for a new protocol and that edit
adds a new source file.  The merge of these two branches may succeed
and contain both edits, but it doesn't have both features.

The git glossary calls a merge "evil" if the commit contains a change
that is not present in either parent.  I say that's a bad
definition.  In the example above, I think it's a good thing to edit
the merge so that the commit contains both features.

This is why I think this theory of edits has value: we want to
manipulate features and if we have one-feature=one-edit and we know
how to manipulate edits, then we can manipulate features.  I don't
think it is a new concept; I think it has been implied any number of
places; I just hope with clear terms to describe manipulation of edits
like split/join/inverse/patch/reorder that we have clearer description
of what we do.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html