Git terminology: remote, add, track, stage, etc.

Thore Husfeldt <thore.husfeldt@xxxxxxxxx> · Mon, 18 Oct 2010 22:45:50 +0200

I’ve just learned Git. What a wonderful system, thanks for building
it. 

And what an annoying learning experience. 

I promised myself to try to remember what made it all so hard, and to
write it down in a comprehensive and possibly even constructive
fashion. Here it is, for what it’s worth. Read it as the friendly, but
somewhat exasparated suggestions of a newcomer. I’d love to help (in
the form of submitting patches to the documentation or CLI responses),
but I’d like to test the waters first.

So, in no particular order, here are the highlights of my former
confusion, if only for your entertainment. Comments are welcome, in
particular where my suggestions are born out of ignorance.

Remote (tracking) branches
--------------------------

There are at least two uses of the word *tracking* in Git's
terminology.

The first, used in the form “git tracks a file” (in the sense that Git
knows about the file) is harmless enough, and is handled under `git
add` below.

But the real monster is the *tracking branch*, sometimes called the
remote branch, the remote-tracking branch, or the remote tracking
branch. Boy did that ever confuse me. And, reading the git mailing
list and the web, many others. There are so many things wrong with how
this simple concept is obfuscated by the documentation that I have a
hard time organising my thoughts about writing it down.

Please, *please* fix this. It was the single most confusing and
annoying part of learning Git.

First, the word, “tracking”. These branches don’t track or follow
anything. They are standing completely still. Please believe me that
when first you are led to believe that origin/master tracks a branch
on the remote (like a hound tracks it quarry, or a radar tracks a
flight) that it is very difficult to hunt this misunderstanding down:
I believed for a long time that the tracking branch stayed in sync,
automagically, with a synonymous branch at the remote. The CLI and
documentation worked very hard to keep me in that state of
ignorance. I *know* that my colleague just updated the remote
repository, yet the remote branch (or is the remote tracking branch?
or the remote-tracking branch?) is as it always was...? (How could I
*ever* believe that? Well, *now* I get it, and have a difficult time
recollecting that misunderstanding. *Now* it’s easy.)

Second, the word “remote” as opposed to “local”, a dichotomy enforced
by both the documentation and by the output of `git branch -r` (list
all remote branches, says user-manual.txt). Things began to dawn on me
only when I understood that origin/master is certainly and absolutely
a “local” branch, in the sense that it points to a commit in my local
repository. (It differs from my other local branches mainly in how it
is updated. It’s not committed to, but fetched to. But both are local,
and the remote can be many commits ahead of me.)

So, remote tracking branches are neither remote (they are *local*
copies of how the remote once was) and they stand completely still
until you tell them to “fetch”. So remote means local, and tracking
means still, “local still-standing” would be a less confusing term
that “remote tracking”. Lovely.

Tracking branches *track* in the sense that a stuffed Basset Hound
tracks. Namely, not. It‘s a dream of what once was.

The hyphenated *remote-tracking* is a lot better terminology already
(and sometimes even used in the documentation), because at least it
doesn't pretend to be a remote branch (`git branch -r`, of course,
still does). So that single hyphen already does some good, and should
be edited for consistency. (It did take time for me to convince myself
during the learning process that “remote tracking” and
“remote-tracking” probably are the same thing, and “tracked remote”
something else, abandoning and resurrecting these hypetheses several
times.)

And *even if* the word was meaningful and consistenly spelt, the
documentation uses it to *refer* to different things. Assume that we
have the branches master, origin/master, and origin’s master
(understanding that they exist, and are different, is another Aha!
moment largely prevented by the documentation). For 50 points, which
is the remote tracking branch? Or the remote-tracking branch? The
remote branch? Which branch tracks which other branch? Does master
track anything? Nobody seems to know, and documentation and CLI
include various inconsistent suggestions. (I know there have been
long, and inconclusive threads about this on the git mailing list, and
I learned a lot from seeing other people’s misconceptions mirror my
own.)  Granted, I think the term “tracked remote branch” is used with
laudable consistentcy to refer to a branch on the remote. And “remote
tracking branch” (with our without the hyphen) more often than not
refers to origin/master. It may be that terminology is slowly
converging. (To something confusing, but still...)

But to appreciate how incredibly difficult this was to understand,
check this, from the Git Community book:

    A 'tracking branch' in Git is a local branch that is connected to
    a remote branch.

To a new user, who *almost* gets it, this is just a slap in the
face. Which one of these is origin/master again? None? (Or rather, it
is the confirmation one needs that nobody in the Git community cares
much, so the once-believed-to-be-carefully-worded documentation loses
some of its authority and therefore the learner can abandon some
misunderstandings.)

There probably is a radical case to be made for abandoning the word
“tracking” entirely. First, because tracking branches don’t track, and
second because “tracking” already means something else in Git (see
below). I realise that this terminology is now so ingrained in Git
users that conservatism will probably perpetuate it. But it would be
*very* helpful to think this through, and at least agree on who
“tracks” what. In the ideal world, origin/master would be something
like “the fetching branch” for the origin’s master, or the “snapshot
branch” or the “fetched branch”. (I am partial to use “fetching”
because it makes that operation a first-class conceptual citizen,
rather than pulling, which is another siren that lures newbies into a
maelstroem of confusion.)

More radically, I am sure some head scratching would be able to find
useful terminology for master, origin/master, and origin’s master. I’d
love to see suggestions. As I said, I admire how wonderfully simple
and clean this has been implemented, and the documentation, CLI, and
terminology should reflect that.

The staging area
----------------

The wonderful and central concept of staging area exists under at
least three names in Git terminology. And that’s really, really
annoying. The index, the cache, and the staging area are all the same,
which is a huge revelation to a newcomer.

This problem could of course be easily fixed by making up your
mind. The decision which of the three terms to adopt is somewhat
arbitrary, but *staging area* gives the strongest and best
metaphor. It also verb quite well, even though it is not the best,
shortest noun. *Index* would have been a good word for the files known
to Git (what is now called, sometimes, “tracked files”), and *cache*
is terrible in any case.

`git stage` is already part of the distribution. Great.

1. Search for index and cache in the documentation and rephrase any
and all their occurences to use “staged” (or, if it can’t be avoided
“the staging area”) instead. Say “staged to be committed” often, it’s
a strong metaphor.

2. Introduce the alias `git unstage` for `git reset HEAD` in the
standard distribution.

3. Duplicate various occurences of `cached` flags as `staged` (and
change the documentation and man pages accordingly), so as to have,
e.g., `git diff --staged`.

git status
----------

One of the earliest-to-use commands is `git status`, whose message are
*wordy*, but were initially completely unhelpful to me. In particular,

   working directory clean

Clean? What’s this now? Clean and dirty are Git slang, and not what I
want to meet as a new user. The message should inform me that the
untracked files in the working directory are equal to their previous
commit. But there are other things wrong with the message. For
example, even though there’s nothing to commit: `nothing added to
commit but untracked files present (use "git add" to track)`? The last
paranethesis should set off warning bells already. And what did clean
mean with respect to untracked files? And “added to commmit”? That
sounds like amending. We add to the index or the staging area, don’t
we, “ready to be included in the next commit,” so they aren’t added to
that commit quite yet?

    changed but not updated:

I’m still not sure what “update” was ever supposed to mean in this
sentence. I just edited the file, so it’s updated, for crying out
loud! The message might just say “Changed files, but not staged to be
committed.”  The meant-to-be helpful “use [...] to update what will be
committed” is another can of worms, and I can find at least two ways
to completely misunderstand this. Change to “use `git stage <file>` to
stage”. (With the new command name it’s almost superfluous.)

Here are some concrete suggestions:

1.

    nothing added to commit but untracked files present

should be

    nothing staged to commit, but untracked files present

(Comment: maybe “... but working directory contains untracked files.”
I realise that “directory” is not quite comprehensive here, because
files can reside in subdirectories. But I’d like to be more concrete
than “be present”.)

2.
    Untracked files:
    (use "git add <file>..." to include in what will be committed)

should be

    Untracked files:
    (use "git track <file>" to track)

3.

    Changes to be committed:
    (use "git reset HEAD <file>..." to unstage)

should be

    Staged to be committed:		 
    (use "git unstage <file>" to unstage)

Adding
------

The tutorial tells us that 

    Many revision control systems provide an add command that tells
    the system to start tracking changes to a new file. Git's add
    command does something simpler and more powerful: git add is used
    both for new and newly modified files, and in both cases it takes
    a snapshot of the given files and stages that content in the
    index, ready for inclusion in the next commit.

This is true, and once you grok how Git actually works it also makes
complete sense. “Making the file known to Git” (sometimes called
“tracking the file”) and “staging for the next commit” result in the
exact same operations, from Git’s perspective.

But this is a good example of what’s wrong with the way the
documentation thinks: Git’s implementation perspective should not
define how concepts are explained. In particular, *tracking* (in the
sense of making a file known to git) and *staging* are conceptually
different things. In fact, the two things remain conceptually
different later on: un-tracking (removing the file from Git’s
worldview) and un-staging are not the same thing at all, neither
conceptually nor implementationally. The opposite of staging is `git
reset HEAD <file>` and the opposite of tracking is -- well, I’m not
sure, actually. Maybe `git update-index --force-remove <filename>`?
But this only strenghtens my point: tracking and staging are different
concepts, and therefore deserve different terms in the documentation
and (ideally) in the CLI.

The entire quoted paragraph in the tutorial can be removed: there’s
simply no reason to tell the reader that git behaves differently from
other version control systems (indeed, to take some perverse *pride*
in that fact). 

Fixing this requires no change to the implementation. `git stage` is
already a synonym for `git add`. It merely requires discipline in
using the terminology of staging. Note that it completely valid to
tell the reader, maybe immediately and in a footnote, that `git add`
and `git stage` *are* indeed synonyms, because of Git’s elegant
model. In fact, given the amount of documentation cruft one can find
on the Internet, this would be a welcome footnote.

An even more radical suggestion (which would take all of 20 seconds to
implement) is to introduce `git track` as another alias for `git
add`. (See above under `git status`). This would be especially useful
if tracking *branches* no longer existed.

There’s another issue with this, namely that “added files are
immediately staged”. In fact, I do understand why Git does that, but
conceptually it’s pure evil: one of the conceptual conrnerstones of
Git -- that files can be tracked and changed yet not staged, i.e., the
staging areas is conceptually a first-class citizen -- is violated
every time a new file is “born”. Newborn files are *special* until
their first commit, and that’s a shame, because the first thing the
new file (and, vicariously, the new user) experiences is an
aberration. I admit that I have not thought this through.--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html