git status --porcelain is a mess that needs fixing

Eric Raymond <esr@xxxxxxxxxxxxxxxxx> · Fri, 9 Apr 2010 14:46:08 -0400 (EDT)

I'm going to gripe a lot in this mail, possibly verging on flaming.
Therefore I want to start by making clear that I am not here to
complain without pitching in to help fix the problems.  If I can get
responsive answers to my questions, I will take responsibility for
editing them into the relevant git documentation,

Short version: "git status --porcelain" is horribly badly documented
and appears to be seriously maldesigned.  Both these problems need to
be fixed before git causes a lot of unnecessary grief for people
trying to use it.

Here is the entire documentation on this feature in HEAD:

=============================================================================

In short-format, the status of each path is shown as

	XY PATH1 -> PATH2

where `PATH1` is the path in the `HEAD`, and ` -> PATH2` part is
shown only when `PATH1` corresponds to a different path in the
index/worktree (i.e. renamed).

For unmerged entries, `X` shows the status of stage #2 (i.e. ours) and `Y`
shows the status of stage #3 (i.e. theirs).

For entries that do not have conflicts, `X` shows the status of the index,
and `Y` shows the status of the work tree.  For untracked paths, `XY` are
`??`.

    X          Y     Meaning
    -------------------------------------------------
              [MD]   not updated
    M        [ MD]   updated in index
    A        [ MD]   added to index
    D        [ MD]   deleted from index
    R        [ MD]   renamed in index
    C        [ MD]   copied in index
    [MARC]           index and work tree matches
    [ MARC]     M    work tree changed since index
    [ MARC]     D    deleted in work tree
    -------------------------------------------------
    D           D    unmerged, both deleted
    A           U    unmerged, added by us
    U           D    unmerged, deleted by them
    U           A    unmerged, added by them
    D           U    unmerged, deleted by us
    A           A    unmerged, both added
    U           U    unmerged, both modified
    -------------------------------------------------
    ?           ?    untracked
    -------------------------------------------------

=============================================================================

This was clearly written as an aide-memoire by someone intimately
familiar with the system, but I have to tell you it is so confusing
to me as to be nearly worse than useless.  

In addition, some of the design choices it appears to imply are quite
bad - so I hope I am wrong about those implications.  If I am not, you
have specified a misdesigned format that will frustrate and annoy your
customers (script and front-end writers). And that would be a problem.

As I criticize, bear in mind that (a) none of my issues are VC
specific, and (b) I am the author of several version-control front
ends - *I have done this before.* My objections are *not*
theoretical!

First, the documentation issues, in roughly increasing order of severity:

1. What separates the XY column from the first path?

I'd assume a tab, but it's not documented. It needs to be documented.

2. What separates the '->' on either side from the path columns?

Not documented.  Needs to be documented.  

3. What do the status codes M A D R C mean?

I can guess, but I should not have to guess.  They should be documented.

4. Some columns in the table have sets of codes enclosed by [].  Is
this indicating alternation?

My guess is yes, but I should not have to guess.  This should be documented.

5. What is 'us' versus 'them'? What are "stage #2" and "stage #3"?

It makes my brain hurt just trying to list all the things "us"
and "them" could mean.

Remember that because you're advertising a format for script use, your
audience for this page is not git hackers.  It's not git power
users. It's not even ordinary git users. It's people whose main
expertise is is *other tools*.  They want to get in, write their
script and get out, having learned as little about git as they can get
away with.

If 'us'/'them'/'stage #2'/'stage #3' are git terms of art that are well
defined elsewhere, you must reference that elsewhere.  If they are
not, you need to define them here.  And because of the special
audience for this page, it needs to be more self-contained and make
fewer assumptions about the reader's knowledge than usual.

Note: I, personally, read very fast and don't mind the mental effort
of skimming 50-100 pages of other documentation.  But you must *not*
assume I am anything but an exception.  This *particular* section on
this *particular* page needs (more than others) to be written so it
would be comprehensible to a lazy idiot who vaguely knows about
otther version-control systems and can't be bothered to read 
about this one, either.  

Now to the functional problems, again in roughly increasing order of
severity:

A. The '->' separator considered harmful

The '->' was superfluous and thus a poor design choice; the
distinction between two columns and three columns is easy enough to
make in any scripting language.  As it is, it's meaningless and
scripts will actually have to go to some extra effort to throw it
away.

I think the underlying problem here is that whoever designed this
never got past the idea that it needed to have cues for human
eyeballs in it.  That was a mistake.  If you're serious about it
being easily parseable, design it that way.

B. Does "untracked" include "ignored"?  

If so, that is a problem -- front ends care about the difference, for
example when C-x v v is trying to compute the logical next action.
For an unregistered file, it's to register it.  For an ignored file,
it's to throw a user-visible error.

C. If "untracked" does not include "ignored", how is an ignored file tagged?

If ignored files are not listed, that's another problem. Even more
serious, actually.

D. How do I tell the conflict/no-conflict cases apart?

You have three divisions in the table.  The first two are supposed
to pertain to "entries that do not have conflicts" and "unmerged
entries".  

They share code letters.  *How do I tell them apart?* 

Illustrative case: I see the status code "DD". How do I distinguish
between case 4 ("deleted from index") and case 10 ("unmerged, both
deleted")?

If the distinction is meaningless, then why are they listed
separately?

E. Are you *really* using a space as a status character?

It certainly appears so from the first and seventh rows of the table.
If so, this was a major blunder. It complicates parsing code
unnecessarily, because the easiest way to separate columns is with the
equivalent of a Python or Perl split() operation that will eat that
space.  Then we have to special-case depending on the field width.

The correct way to design a format like this for script parseability
is to (a) never make the difference between space and tab significant,
and (b) never use whitespace as anything but a field separator.  If
you want the equivalent of "blank" you use '-', as in Unix ls -l
output.

This may sound like a nitpick, but it's actually a crash landing, or
close to it.  Front-end writers look at things like this and think
"Idiots.  Can't trust them an inch...".  And git already has a bad
reputation for interface spikiness to live down.
-- 
		<a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>

A right is not what someone gives you; it's what no one can take from you. 
	-- Ramsey Clark
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html