Re: [doc] User Manual Suggestion

David Abrahams <dave@xxxxxxxxxxxx> · Mon, 27 Apr 2009 12:30:58 -0400

On Apr 26, 2009, at 6:25 PM, Björn Steinbrink wrote:

On 2009.04.26 16:17:43 -0400, David Abrahams wrote:

I'm telling you, many new users who aren't already versed in Git will
naturally associate the SHA1 codes exposed by the interface with the
files they've checked in without understand that they actually
identify object files (another poorly chosen Git name, if I've manage
to deduce what it means)

Hm, not sure if that name is really important. The way objects are
stored is an implementation detail. Usually, we're just talking about
"objects" not the files the loose objects are stored in (loose  
object =
an object stored in its own file, not in a pack file). But as you
complained about it, how would you call a file in which an object is
stored?

"Object" is OK. "Object file" is overloaded and confusing.  I'd just  
say there are "Git data files" or "files in Git's object store", some  
of which store single objects whose id is the same as the filename,  
and some of which store multiple objects.

rather than directly corresponding to states
of their files. And anyway, if you want to get into implementation
details, SHA1s don't always identify object files because blobs get
delta-compressed.

True, they identify the object, it's not even necessesary to mention
delta compression, just having the object in a pack file causes the
object name to no longer identify the file in which the object can be
found.

Right.

Heck, the object might be in a different repo when you use
alternates ;-). And I think I never explicitly said that they
identify a file storing an object, but implied that by "accepting"  
your
example and assuming that you meant two object files having the same  
id.

Yes, that assumption was wrong, and then when you responded using the  
term "object file" I didn't know what it meant.

I should have said that your "two distinct files have the same id"  
makes
no sense and should have asked what you mean.

And why would your implementation save the same object twice, in
two distinct files?

One could easily have the expectation that contents can be
duplicated because there are numerous precedents in everyone's
experience of computing, for example in filesystems and in any
programming language that is not pure-functional.

That's not answering my question. I asked why you come up with an
implementation that is "broken" enough to save the same object twice
with different file names.

I don't know what you mean by "come up with an implementation."  I'm
not inventing an implementation.

Sorry, "come up with" is clearly wrong. "Assume" or "expect" or so  
might
have been more correct.

I think I explained why one might make that assumption.

But I think we could agree that you misused the
"id" term by using it for files, and what ensued confused both of  
us? If
you didn't mean the stored objects by "files", then that part of the
discussion was just based on a misunderstanding and can be ignored.

I meant what the user thinks of as files stored in the repository.

I'm saying, new users inevitably and inexorably develop a mental  
model
of the system they're learning about, and they don't always develop
the right mental model, and I'm saying that it's easy to see how they
can fall into incorrect assumptions.  The word "hash" helps a bit  
with
avoiding one of those assumptions.

I've not met a lot of people that were actually confused about the  
fact
that the same object might be "reused" for tree entries with different
names. But most (all?) of those that were confused knew that the  
objects
are identified by hashes, but expected the filenames to be part of the
object and didn't know about tree objects.

Well, there's certainly precedent for the idea that the filenames are  
distinct from file contents.

And anyway, when the user notices something, that's a discovery, not
an expectation.

It's better to give people something to connect their discoveries to
(e.g. "oh, I see, they call those things hashes, so it makes sense
that these two identical things are stored once")

We're talking about seeing, for example,  the same object name more  
than
once, for different "files", in e.g. gitweb, right? Then the "Hu?  
Isn't
the filename part of the object?" thing might still apply. The user  
can
still very easily make a wrong guess.

As Michael said in another mail, the important point is probably  
rather
to teach people to make a distinction between files and directories in
the working tree and the contents stored in the git objects. And  
that's
not accomplished by saying that the id is a hash, when the user  
doesn't
know what the hash is based upon.

Somewhat related: I'm trying to remember if I ever had problems
explaining the concept of hardlinks to someone, but I don't remember  
any
such situation anymore. There are no hashes involved there, and I feel
like that was quite easy to grasp for most people I talked to. It's
pretty similar, separating content from names.

The difference is that hardlinks are only generated explicitly.  You'd  
need something like a hash to generate them automatically and  
implicitly.

You can't have two objects with the same contents to begin with,
same content => same object.

In the Git world, I agree.  In general, I disagree.

I don't think were discussing a term to describe something that
identifies an object in general. So, "in general" you can disagree  
as
much as you want, but for git that doesn't matter at all.

You don't think the general rules of the computing world and existing
meanings of terms have an impact on a new user's ability to grok Git?
If not, we don't have much to discuss.

This was probably also based on the files+id misunderstanding combined
with the fact that you used the term "object" where I thought that you
meant a "git object" (you probably didn't, right?).

I didn't.  I meant the general notion of "object" in computing.  I'm  
trying to talk about how the language used by Git's docs can bias  
people toward correct or incorrect understandings of Git as they're  
learning.

Because when talking
about "git objects" you actually can't have two different ones with  
the
same "value" (I guess you mean type, size and content when you say
"value", right?)

Yes.  Size is a function of content, so that adds nothing, and whether  
it even makes sense to say that two things of different type have  
identical content is debatable.

And admittedly, for this one, the "hash" term _would_ help to get the
user to understand that in git you cannot have two different objects
with the same contents and that this makes git different and  
efficient.
But I still don't buy that this is important for understanding the  
basic
data model. It's a nice hint why git can always quickly tell that two
things are equal and why the repository size doesn't explode. But the
important part is the separation of names and content, that trees give
names to the contents stored in blobs.

But there's nothing unique about that; it's not distinct from what  
filesystems do.

The "hash" name would only help
to understand its efficiency once you already understood the data  
model.

It would help to reinforce that an object's id is a function of its  
contents.  It would help to make clear why the same object can be  
identified in the same way across all repos.

Another way to express what you wrote above:

 same same id => same hash ?=> same contents => same object

where ?=> means "almost certainly implies."

No, that chain shows how git could be "unreliable" when you get hash
collisions. You could put that into a chapter that explains the
implications of the way git generates its object ids. But it's not
very interesting when you use git and (implicitly) trust the
assumption  that no collisions happen.

My point in mentioning that it's not certain was to point out that  
you
left out the implication that actually /is/ certain, even across
repos.

And my point is that this is not important for understanding the basic
data model, but only how git efficiently implements it, and which
assumptions it has to make.

Look, you're talking to someone who has just had to go through the  
process of learning all this stuff.  What I'm telling you is based on  
my experiences.  Just one datapoint, to be sure, but knowing that it's  
a hash was crucial for me.

If you think there's a right order, you haven't understood that all
the arrows are bidirectional.

There's one that is not truly bidirectional.

id <=> hash <?=> contents <=> object

I can't go from id/hash to contents/object without hitting the "hash  
=>
content" assumption.

Quite right.  You can't derive contents from the hash.

But that's a strawman.  I'm not claiming that it magically explains
all the other things.  I'm just claiming that it helps in avoiding
some possible misunderstandings.

And I think that it doesn't help much at all and might confuse users,
because they expect the hash to be based on the wrong stuff. It's just
important that the "thing" is used to identify an object.

OK, I give up.  *I* now understand the system, and it's starting to  
look like too much of a struggle to improve things for others, so they  
can fend for themselves I guess.

Thanks for the lively discussion, anyway.

--
David Abrahams
BoostPro Computing
http://boostpro.com

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html