Re: [doc] User Manual Suggestion

David Abrahams <dave@xxxxxxxxxxxx> · Sun, 26 Apr 2009 16:17:43 -0400

On Apr 26, 2009, at 1:56 PM, Björn Steinbrink wrote:

On 2009.04.26 09:55:34 -0400, David Abrahams wrote:

On Apr 26, 2009, at 7:28 AM, Björn Steinbrink wrote:

On 2009.04.25 15:36:24 -0400, David Abrahams wrote:
Where it's relevant when the user notices that two distinct files
have the same id (because they happen to have the same contents)  
and
wonders what's up.

Why would the user have to care about the object files in the repo?

What a strange question. I have no idea how to answer. It seems
self- evident to me that users of a VCS care that their files are
stored in it.

_Their_ files. The files that come from/end up in the working tree. I
cared about those when I used SVN, too. But I never went to the SVN  
repo
to find out if there are two equal files in it. We're talking about
object names, and those belong to objects, not files in the working
tree.

I'm telling you, many new users who aren't already versed in Git will  
naturally associate the SHA1 codes exposed by the interface with the  
files they've checked in without understand that they actually  
identify object files (another poorly chosen Git name, if I've manage  
to deduce what it means) rather than directly corresponding to states  
of their files. And anyway, if you want to get into implementation  
details, SHA1s don't always identify object files because blobs get  
delta-compressed.

And why would your implementation save the same object twice, in two
distinct files?

One could easily have the expectation that contents can be duplicated
because there are numerous precedents in everyone's experience of
computing, for example in filesystems and in any programming language
that is not pure-functional.

That's not answering my question. I asked why you come up with an
implementation that is "broken" enough to save the same object twice
with different file names.

I don't know what you mean by "come up with an implementation."  I'm  
not inventing an implementation.  I'm saying, new users inevitably and  
inexorably develop a mental model of the system they're learning  
about, and they don't always develop the right mental model, and I'm  
saying that it's easy to see how they can fall into incorrect  
assumptions.  The word "hash" helps a bit with avoiding one of those  
assumptions.

If the implementation does not do that, your
"when the user notices that two distinct files has the same id" is
immediately invalid. The user cannot come into that situation then.

I think this is why Git remains more opaque than it should be.  You  
can't assume that people will naturally develop the smartest possible  
mental model of a VCS, even with faced with some hints in the form of  
a partial understanding of Git.

And
anyway, when the user notices something, that's a discovery, not an
expectation.

It's better to give people something to connect their discoveries to  
(e.g. "oh, I see, they call those things hashes, so it makes sense  
that these two identical things are stored once")

The SHA-1 hash is created from the object, that means
the its type, size and data. It's not an id of a file in the working
tree, but of an object

All true.  All somewhat subtle distinctions that are not nearly as
apparent unless you actually use the word "hash" as I have been
advocating.

Hu? How does saying "object hash" instead of "object id" make it any
more apparent that a file in the working tree is something else than a
git object?

It makes it apparent that two identical things can only have one ID,  
and thus must correspond to one object.

You can't have two objects with the same contents to begin with,  
same
content => same object.

In the Git world, I agree.  In general, I disagree.

I don't think were discussing a term to describe something that
identifies an object in general. So, "in general" you can disagree as
much as you want, but for git that doesn't matter at all.

You don't think the general rules of the computing world and existing  
meanings of terms have an impact on a new user's ability to grok Git?   
If not, we don't have much to discuss.

The fact that is so in the Git world is reinforced by the notion that
the id of an object is a hash of its contents.

You can just have that one object stored multiple times in different
places (for sane implementations this  likely means that you have
more than one repo to look at, and each has its  own copy of that
object, but that's nothing you as an user should have to care  
about).

It's an identity relation: same name/id => same object. Unlike  
e.g. a
hash-table where you are expected to deal with collisions, and  
having
the same hash doesn't mean that you have identical data.  But that's
not true of git, it expects an identity relation, which is IMHO
better expressed through "object name" or "object id".

Yes, that's true in the Git world (though not necessarily  
elsewhere), or
at least you hope it is.  In fact, there's no guarantee that SHA1
collisions won't occur; it's just exremely unlikely.  In fact, if you
google it you can find some interesting papers about SHA1 collision.

Sure, it's an assumption that has been made and is required to hold  
true
for git to work.

Another way to express what you wrote above:

  same same id => same hash ?=> same contents => same object

where ?=> means "almost certainly implies."

No, that chain shows how git could be "unreliable" when you get hash
collisions. You could put that into a chapter that explains the
implications of the way git generates its object ids. But it's not  
very
interesting when you use git and (implicitly) trust the assumption  
that
no collisions happen.

My point in mentioning that it's not certain was to point out that you  
left out the implication that actually /is/ certain, even across repos.

Only when you want to explain how git manages to avoid duplicated
storage of fully identical contents, then you need to mention that the
object names are the hashes of the full object contents. But that's  
not
what you actually use the object names for.

same content ==> same content hash ==> object name/id ==> same object

(Actually, you need an additional detail: "same
file/symlink/directory/... contents ==> same object contents", which
can't be made explicit by just saying that you use a hash).

Your chain was in the wrong order

If you think there's a right order, you haven't understood that all  
the arrows are bidirectional.

and explains neither the "a tree that
has the same object name/id for two entries" case (because of the
uncertainity of the "same hash ?=> same content" part), nor, when read
in the other direction, where all implications are true, why same
content leads to the same object (as it already starts at the object
level).

I think the implication is important in both directions.  Neither  
one is
self-evident to a new user.  Maybe the right answer is 'hash id'.

git could work different. Just moving the storage of the filenames  
from
the tree objects to the blobs would mean that you'd get different
objects for files that have the same content but different names.  
You'd
still have a hash of the object contents as the object name, but
suddenly you get more objects. Just saying "hash" or "hash id" doesn't
magically explain all the other things.

But that's a strawman.  I'm not claiming that it magically explains  
all the other things.  I'm just claiming that it helps in avoiding  
some possible misunderstandings.

--
David Abrahams
BoostPro Computing
http://boostpro.com

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html