Re: Git terminology: remote, add, track, stage, etc.

Drew Northup <drew.northup@xxxxxxxxx> · Tue, 19 Oct 2010 17:53:34 -0400

On Mon, 2010-10-18 at 22:45 +0200, Thore Husfeldt wrote:
> Iâve just learned Git. What a wonderful system, thanks for building
> it. 
> 
> And what an annoying learning experience. 

I have to admit having dealt with quite a few annoyances, but mostly
because I'm attempting to implement new functionality into this
project--something that requires minimally groking the sections of code
that I propose to change. (Anything less is sheer idiocy--and this is a
crowd that would not hesitate to say so.)

> I promised myself to try to remember what made it all so hard, and to
> write it down in a comprehensive and possibly even constructive
> fashion. Here it is, for what itâs worth. Read it as the friendly, but
> somewhat exasparated suggestions of a newcomer. Iâd love to help (in
> the form of submitting patches to the documentation or CLI responses),
> but Iâd like to test the waters first.
> 
> So, in no particular order, here are the highlights of my former
> confusion, if only for your entertainment. Comments are welcome, in
> particular where my suggestions are born out of ignorance.

As a user and survivor of other source code management, collaboration,
and versioning systems (not to mention some OS/X Windows/uC/CPU innards)
I did not find git nearly as jolting as you appear to have. I suspect
this will be clear in following discussions.

> Remote (tracking) branches
> --------------------------
> 
> There are at least two uses of the word *tracking* in Git's
> terminology.
> 
> The first, used in the form âgit tracks a fileâ (in the sense that Git
> knows about the file) is harmless enough, and is handled under `git
> add` below.
> 
> But the real monster is the *tracking branch*, sometimes called the
> remote branch, the remote-tracking branch, or the remote tracking
> branch. Boy did that ever confuse me. And, reading the git mailing
> list and the web, many others. There are so many things wrong with how
> this simple concept is obfuscated by the documentation that I have a
> hard time organising my thoughts about writing it down.
> 
> Please, *please* fix this. It was the single most confusing and
> annoying part of learning Git.
> 
> First, the word, âtrackingâ. These branches donât track or follow
> anything. They are standing completely still. Please believe me that
> when first you are led to believe that origin/master tracks a branch
> on the remote (like a hound tracks it quarry, or a radar tracks a
> flight) that it is very difficult to hunt this misunderstanding down:
> I believed for a long time that the tracking branch stayed in sync,
> automagically, with a synonymous branch at the remote. The CLI and
> documentation worked very hard to keep me in that state of
> ignorance. I *know* that my colleague just updated the remote
> repository, yet the remote branch (or is the remote tracking branch?
> or the remote-tracking branch?) is as it always was...? (How could I
> *ever* believe that? Well, *now* I get it, and have a difficult time
> recollecting that misunderstanding. *Now* itâs easy.)

Two things here: 
(1) The meaning of "tracking" in context differs from your abstract
notion of tracking. Perhaps the metaphor is better seen as the
"tracking" of wild game animals--you follow their footprints, grazing
marks, paths, and scat (viewed with gitk ? ;-) ) yet only once in a
while can you reach out and "pull" on them (at least one doesn't have to
shoot things to make git work--although it might help on some days).

(2) Hyphenation should, in theory, be consistent. I do not however
expect this to be completely automatic in a population of code authors
which contains a great deal of non-native speakers of English
(conceivably some people here may even read and write English without
feeling competent to speak it out loud). Due to this reality I usually
keep my inner "Grammar Nazi" in check with free/open software projects.
This does not make the documentation any easier to read, it just keeps
me from breaking things while I do. 
As an aside, whose "English" are we to deem nominally correct--ASE, BSE,
ISE, or perhaps so-called "Singapore English" (which is actually spoken
other places as well)? Just because American Standard English currently
has (eroding) hegemony in the software documentation sphere (from my
perspective) does not grant me some special platform to shove it down
the throats of others. Thankfully clarification of hyphenation by
introducing consistency is likely to be universally received in a
positive light.

> Second, the word âremoteâ as opposed to âlocalâ, a dichotomy enforced
> by both the documentation and by the output of `git branch -r` (list
> all remote branches, says user-manual.txt). Things began to dawn on me
> only when I understood that origin/master is certainly and absolutely
> a âlocalâ branch, in the sense that it points to a commit in my local
> repository. (It differs from my other local branches mainly in how it
> is updated. Itâs not committed to, but fetched to. But both are local,
> and the remote can be many commits ahead of me.)
> 
> So, remote tracking branches are neither remote (they are *local*
> copies of how the remote once was) and they stand completely still
> until you tell them to âfetchâ. So remote means local, and tracking
> means still, âlocal still-standingâ would be a less confusing term
> that âremote trackingâ. Lovely.

This is one area in which excessive steeping in the handling, care,
feeding, and management of other SCM systems would help understand what
git is up to.

The "origin/master" is seen semantically as a locally stashed but not
intended to be used directly "copy" of the origin's master--you create a
local branch for that--in your local object store. This apparent copy
can be thought of better perhaps as a pointer into the object store to
call upon the "origin/master" locally-stored state (origin's master as
your local object store remembers it). It can be used to follow changes
made to the remote--origin's--master (but one is not forced to do so).
Often times it may be found mirrored in (more correctly, merged into)
the local branch named "master"--which you created, most likely via
cloning. Therefore it can be confusing at some level--at first--to
realize that "origin/master" and "master" are not the same thing yet may
sure appear to be. Once you take the pointer for what it is this makes a
bit more sense. (This is why you can "git checkout origin/master" if you
want...)
Taking this metaphor one step further, we DE-REFERENCE and merge
"origin/master" into "master" (or whatever you named your
"remote-tracking" branch) as the last bulk operation when executing a
"pull" on master. This is explained fairly clearly (in my mind) in the
documentation as a convenience melding of a "git fetch"--origin's master
(remote) is copied into the local object store as "origin/master"--and a
"git merge origin/master" executed while checked-out as the local copy
of the branch you are tracking (you probably named it master).

> Tracking branches *track* in the sense that a stuffed Basset Hound
> tracks. Namely, not. Itâs a dream of what once was.

YES!!! Ok, well, except for the stuffed part. Never have your dog
stuffed (ask Alan Alda, or read his book of the same name).

> The hyphenated *remote-tracking* is a lot better terminology already
> (and sometimes even used in the documentation), because at least it
> doesn't pretend to be a remote branch (`git branch -r`, of course,
> still does). So that single hyphen already does some good, and should
> be edited for consistency. (It did take time for me to convince myself
> during the learning process that âremote trackingâ and
> âremote-trackingâ probably are the same thing, and âtracked remoteâ
> something else, abandoning and resurrecting these hypetheses several
> times.)

This, I'm sure, can be rectified with minimum pain. Recall my note above
about non-native speakers. I don't expect them (or for that matter most
native speakers) to be in the habit of making use of English punctuation
tools that most American Language-Arts teachers don't seem to have
mastered either. Heck, I'll readily admit that I only got a 3 (of 5) on
the English Language AP Exam, so there are obviously some aspects of the
"official" language known to the cognoscenti that I still don't grok
either.

> And *even if* the word was meaningful and consistenly spelt, the
> documentation uses it to *refer* to different things. Assume that we
> have the branches master, origin/master, and originâs master
> (understanding that they exist, and are different, is another Aha!
> moment largely prevented by the documentation). For 50 points, which
> is the remote tracking branch? Or the remote-tracking branch? 

Let's not make a big deal about the hyphen in that case for now.

> The remote branch? 

That would be origin's branch (whatever it is named and whoever "origin"
is in this case).

> Which branch tracks which other branch? 

Our local origin/<branch> is our local object store's memory of what it
last knew origin's branch to look like.

> Does master track anything? 

Per-Se, No. Can it be merged with our latest fetch into the object store
of origin's (origin/...)? Yes. Can it be done with one nice convenient
wrapper command in the current git? Yes!

> Nobody seems to know, and documentation and CLI
> include various inconsistent suggestions. (I know there have been
> long, and inconclusive threads about this on the git mailing list, and
> I learned a lot from seeing other peopleâs misconceptions mirror my
> own.)  Granted, I think the term âtracked remote branchâ is used with
> laudable consistentcy to refer to a branch on the remote. And âremote
> tracking branchâ (with our without the hyphen) more often than not
> refers to origin/master. It may be that terminology is slowly
> converging. (To something confusing, but still...)

Remember, remote branches other than "master" may be tracked. For
instance, if you are tracking git with git then you are also tracking
branches named html, todo, maint, man, pu, and next--yet you may not
have created a local remote-tracking branch for them to reside in
locally.

> But to appreciate how incredibly difficult this was to understand,
> check this, from the Git Community book:
> 
>     A 'tracking branch' in Git is a local branch that is connected to
>     a remote branch.
> 
> To a new user, who *almost* gets it, this is just a slap in the
> face. Which one of these is origin/master again? None? (Or rather, it
> is the confirmation one needs that nobody in the Git community cares
> much, so the once-believed-to-be-carefully-worded documentation loses
> some of its authority and therefore the learner can abandon some
> misunderstandings.)

Again, the author's sense of "connected" and your internal sense of
"connected" did not match--much like "tracking" earlier (and below). He
is not stating that they are bound at the hip, he is merely noting the
presence of a conceptual relationship between the two. It could have
been stated differently perhaps, but it is not the end of the world.

> There probably is a radical case to be made for abandoning the word
> âtrackingâ entirely. First, because tracking branches donât track, and
> second because âtrackingâ already means something else in Git (see
> below). I realise that this terminology is now so ingrained in Git
> users that conservatism will probably perpetuate it. But it would be
> *very* helpful to think this through, and at least agree on who
> âtracksâ what. In the ideal world, origin/master would be something
> like âthe fetching branchâ for the originâs master, or the âsnapshot
> branchâ or the âfetched branchâ. (I am partial to use âfetchingâ
> because it makes that operation a first-class conceptual citizen,
> rather than pulling, which is another siren that lures newbies into a
> maelstroem of confusion.)

Umm, NO. Tracking is a term that is used consistently in most places. It
means "following by collecting information about" as I noted earlier
with my wild game animals example. This is true throughout whether that
being tracked is a file's contents, an entire branch's contents, or the
contents of a whole repository. That you have mis-associated the concept
of tracking with that of fetching the information used to perform that
tracking (and remembering from where and how to do so) is perhaps
something that can be dealt with but it does not require abandoning what
is frankly a useful metaphor. (Besides "fetching" in isolation begs
confusion with other concepts such as a "comely lass"--the great wonder
of the English language indeed.)

> More radically, I am sure some head scratching would be able to find
> useful terminology for master, origin/master, and originâs master. Iâd
> love to see suggestions. As I said, I admire how wonderfully simple
> and clean this has been implemented, and the documentation, CLI, and
> terminology should reflect that.

I did not find the terminology particularly jarring, but I have used
(and survived doing so) other SCM software. Perhaps you did not have any
previous SCM background? More information as to the source of confusion
and your perspective when starting out can only help improve the
documentation.

> The staging area
> ----------------
> 
> The wonderful and central concept of staging area exists under at
> least three names in Git terminology. And thatâs really, really
> annoying. The index, the cache, and the staging area are all the same,
> which is a huge revelation to a newcomer.

Not true. When merging the index may contain multiple staged instances
of any given content needed to resolve conflicts for instance. Also, the
cache is stored inside of the index. Therefore while they may at any
given time have exactly the same contents they are not the same things
nor concepts.

> *Index* would have been a good word for the files known
> to Git (what is now called, sometimes, âtracked filesâ)

Index is used to refer to the mechanism by which the currently operative
CONTENT known to git. Git does not track files per-se, it tracks
CONTENT. This is an important distinction to master. It literally
indexes the contents to be operated upon in the object store. That those
contents happen to exist in files is something it keeps track of but it
really could care less. So far is it is concerned they could be show
tunes.

> `git stage` is already part of the distribution. Great.
> 
> 1. Search for index and cache in the documentation and rephrase any
> and all their occurences to use âstagedâ (or, if it canât be avoided
> âthe staging areaâ) instead. Say âstaged to be committedâ often, itâs
> a strong metaphor.

No. The documentation should not be made incorrect just to make it sound
more consistent.

> 2. Introduce the alias `git unstage` for `git reset HEAD` in the
> standard distribution.

This is evidence to me that you have not used other SCM software. The
idiom "reset" is widely used in various SCM implementations.

> 3. Duplicate various occurences of `cached` flags as `staged` (and
> change the documentation and man pages accordingly), so as to have,
> e.g., `git diff --staged`.

Again, cached is not staged and flags should not be made incorrect just
to cover for the fact that you have not found a use for them separately.

> git status
> ----------
> 
> One of the earliest-to-use commands is `git status`, whose message are
> *wordy*, but were initially completely unhelpful to me. In particular,
> 
>    working directory clean
> 
> Clean? Whatâs this now? Clean and dirty are Git slang, and not what I
> want to meet as a new user. 

This is not git-specific jargon. In fact, it is widely used terminology
throughout the computing world in memory management, databases,
filesystems, and even other SCM platforms. I have even heard it used by
non-computer-oriented people to refer to original and changed states.

> The message should inform me that the
> untracked files in the working directory are equal to their previous
> commit. But there are other things wrong with the message. 

Actually, what it is stating is correct. It knows NOTHING about any
untracked files OTHER THAN that they have not changed since the last
commit. In other words, the POTENTIAL INDEX is clean of unwritten
changes. 

> For
> example, even though thereâs nothing to commit: `nothing added to
> commit but untracked files present (use "git add" to track)`? The last
> paranethesis should set off warning bells already. And what did clean
> mean with respect to untracked files? And âadded to commmitâ? That
> sounds like amending. We add to the index or the staging area, donât
> we, âready to be included in the next commit,â so they arenât added to
> that commit quite yet?

This is analogous to files having been changed inside of an application
yet the application has not yet requested that such changes be scheduled
to be committed to the filesystem. You have to request that such changes
be added to the filesystem layer's idea of what needs to be committed
and then it will be written out in due time. The same thing applies to
git's index and the object store.

> 
>     changed but not updated:
> 
> Iâm still not sure what âupdateâ was ever supposed to mean in this
> sentence. I just edited the file, so itâs updated, for crying out
> loud! The message might just say âChanged files, but not staged to be
> committed.â  

In this case you have already scheduled some changes of a file to be
committed in the index and then have gone and made additional changes
without updating the index. So no, it isn't updated yet--but there are
changes staged to be committed for the very same files.

> The meant-to-be helpful âuse [...] to update what will be
> committedâ is another can of worms, and I can find at least two ways
> to completely misunderstand this. Change to âuse `git stage <file>` to
> stageâ. (With the new command name itâs almost superfluous.)

Here is where a proper discussion of why it is called "git add <file>"
comes into play. When you use the add operation you are literally adding
he current status of that file to the index. If you make another change
to that file before committing you will need to add that new status of
the file. In other words, you have staged the changes to the content in
that file to be committed twice into the index at two different times
and with two different change-sets.

> Here are some concrete suggestions:
> 
> 1.
> 
>     nothing added to commit but untracked files present
> 
> should be
> 
>     nothing staged to commit, but untracked files present
> 
> (Comment: maybe â... but working directory contains untracked files.â
> I realise that âdirectoryâ is not quite comprehensive here, because
> files can reside in subdirectories. But Iâd like to be more concrete
> than âbe presentâ.)

The concept of "present" is fine in this case, just like being present
at a meeting. The comma is useful perhaps to some but may not be
grammatically correct here. As for "added" versus "staged" that depends
on the circumstances. I am having trouble coming up with an example off
the top of my head as to how to explain when "added" and "staged" differ
in meaning I am sure the do (perhaps upon the deletion of a tracked
file?).

> 2.
>     Untracked files:
>     (use "git add <file>..." to include in what will be committed)
> 
> should be
> 
>     Untracked files:
>     (use "git track <file>" to track)

This is not helpful. Git does not work the way that Subversion does. In
other words, each content change must be manually added to the index for
staging to build a commit for each and every commit. Files and their
contents ARE NOT tracked perpetually (touching on earlier noted
confusion about the context-specific meaning of track). We should not
encourage confusion on this matter.

> 3.
> 
>     Changes to be committed:
>     (use "git reset HEAD <file>..." to unstage)
> 
> should be
> 
>     Staged to be committed:		 
>     (use "git unstage <file>" to unstage)

Again, "reset" is a widely used metaphor in SCM software and there is no
good reason to abandon it. What you are proposing is not even a 1:1
replacement. The original command quite literally does the following
"reset index information about <file> to HEAD" while what you propose
merely says "remove staged change about <file> from the index." What if
<file> has been changed and staged to commit more than once? The
proposed syntax is ambiguous and therefore is a poor replacement.

> Adding
> ------
> 
> The tutorial tells us that 
> 
>     Many revision control systems provide an add command that tells
>     the system to start tracking changes to a new file. Git's add
>     command does something simpler and more powerful: git add is used
>     both for new and newly modified files, and in both cases it takes
>     a snapshot of the given files and stages that content in the
>     index, ready for inclusion in the next commit.
> 
> This is true, and once you grok how Git actually works it also makes
> complete sense. âMaking the file known to Gitâ (sometimes called
> âtracking the fileâ) and âstaging for the next commitâ result in the
> exact same operations, from Gitâs perspective.

Not exactly. You forget that git is not stateless. A file may be "known
to git" yet not changed (much less staged) since the last commit. Also,
git does not track changes perpetually--it only updates it's idea
(index) of changes to be committed (stages) when asked. It could be said
that it tracks changes incrementally upon request.

> But this is a good example of whatâs wrong with the way the
> documentation thinks: Gitâs implementation perspective should not
> define how concepts are explained. In particular, *tracking* (in the
> sense of making a file known to git) and *staging* are conceptually
> different things. In fact, the two things remain conceptually
> different later on: un-tracking (removing the file from Gitâs
> worldview) and un-staging are not the same thing at all, neither
> conceptually nor implementationally. The opposite of staging is `git
> reset HEAD <file>` and the opposite of tracking is -- well, Iâm not
> sure, actually. Maybe `git update-index --force-remove <filename>`?
> But this only strenghtens my point: tracking and staging are different
> concepts, and therefore deserve different terms in the documentation
> and (ideally) in the CLI.

As tracking of changes is in part the art of remembering those changes
and making such changes available for inspection upon request (see my
"wild game" metaphor above) the way a detective might "track" evidence
there is no need to conflate the incremental tracking of some changes
before committing them with the list of changes we have tracked into the
object store in the past. This is part of the power of git. It tells you
what content changed when and where that change came from. It "tracked"
all of them. It does not "track" mere files. It tracks CONTENT. When git
"stages" changes it tracks them in its short-term memory (the index) and
when it commits those staged changes to the object store it allows
itself to track them for all time (conceptually).

> The entire quoted paragraph in the tutorial can be removed: thereâs
> simply no reason to tell the reader that git behaves differently from
> other version control systems (indeed, to take some perverse *pride*
> in that fact). 

In fact it is very worth noting that git works differently from other
SCM software. Not knowing what is different leads to a series of
important and potentially disastrous misconceptions. It should take
pride in being different, as it implements the holy grail of SCM: being
able to know who changed what, when, and perhaps even why.

> An even more radical suggestion (which would take all of 20 seconds to
> implement) is to introduce `git track` as another alias for `git
> add`. (See above under `git status`). This would be especially useful
> if tracking *branches* no longer existed.

This is not appropriate. Git is not Subversion. It does not track files
for all time and it is not stateless. Full Stop.

> Thereâs another issue with this, namely that âadded files are
> immediately stagedâ. In fact, I do understand why Git does that, but
> conceptually itâs pure evil: one of the conceptual conrnerstones of
> Git -- that files can be tracked and changed yet not staged, i.e., the
> staging areas is conceptually a first-class citizen -- is violated
> every time a new file is âbornâ. Newborn files are *special* until
> their first commit, and thatâs a shame, because the first thing the
> new file (and, vicariously, the new user) experiences is an
> aberration. I admit that I have not thought this through.--

????? This strikes me as a vast misunderstanding of the mechanism at
work. If you could describe the roots of this idea then perhaps it could
be addressed. Git is not your filesystem.

Hopefully the couple of hours I spent on this helps further this
discussion in a useful manner.

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html