On 2009.04.26 09:55:34 -0400, David Abrahams wrote: > > On Apr 26, 2009, at 7:28 AM, Björn Steinbrink wrote: > >> On 2009.04.25 15:36:24 -0400, David Abrahams wrote: >>> Where it's relevant when the user notices that two distinct files >>> have the same id (because they happen to have the same contents) and >>> wonders what's up. >> >> Why would the user have to care about the object files in the repo? > > What a strange question. I have no idea how to answer. It seems > self- evident to me that users of a VCS care that their files are > stored in it. _Their_ files. The files that come from/end up in the working tree. I cared about those when I used SVN, too. But I never went to the SVN repo to find out if there are two equal files in it. We're talking about object names, and those belong to objects, not files in the working tree. >> And why would your implementation save the same object twice, in two >> distinct files? > > One could easily have the expectation that contents can be duplicated > because there are numerous precedents in everyone's experience of > computing, for example in filesystems and in any programming language > that is not pure-functional. That's not answering my question. I asked why you come up with an implementation that is "broken" enough to save the same object twice with different file names. If the implementation does not do that, your "when the user notices that two distinct files has the same id" is immediately invalid. The user cannot come into that situation then. And anyway, when the user notices something, that's a discovery, not an expectation. >> The SHA-1 hash is created from the object, that means >> the its type, size and data. It's not an id of a file in the working >> tree, but of an object > > All true. All somewhat subtle distinctions that are not nearly as > apparent unless you actually use the word "hash" as I have been > advocating. Hu? How does saying "object hash" instead of "object id" make it any more apparent that a file in the working tree is something else than a git object? >>> It's not a foregone conclusion that objects with the same value have >>> identical ids, but it's immediately apparent if the id is known to >>> be a >>> hash. >> >> You can't have two objects with the same contents to begin with, same >> content => same object. > > In the Git world, I agree. In general, I disagree. I don't think were discussing a term to describe something that identifies an object in general. So, "in general" you can disagree as much as you want, but for git that doesn't matter at all. > The fact that is so in the Git world is reinforced by the notion that > the id of an object is a hash of its contents. > >> You can just have that one object stored multiple times in different >> places (for sane implementations this likely means that you have >> more than one repo to look at, and each has its own copy of that >> object, but that's nothing you as an user should have to care about). > >> It's an identity relation: same name/id => same object. Unlike e.g. a >> hash-table where you are expected to deal with collisions, and having >> the same hash doesn't mean that you have identical data. But that's >> not true of git, it expects an identity relation, which is IMHO >> better expressed through "object name" or "object id". > > Yes, that's true in the Git world (though not necessarily elsewhere), or > at least you hope it is. In fact, there's no guarantee that SHA1 > collisions won't occur; it's just exremely unlikely. In fact, if you > google it you can find some interesting papers about SHA1 collision. Sure, it's an assumption that has been made and is required to hold true for git to work. > Another way to express what you wrote above: > > same same id => same hash ?=> same contents => same object > > where ?=> means "almost certainly implies." No, that chain shows how git could be "unreliable" when you get hash collisions. You could put that into a chapter that explains the implications of the way git generates its object ids. But it's not very interesting when you use git and (implicitly) trust the assumption that no collisions happen. For that case, you need a different chain: same name/id ==> same object ==> same content That's interesting when you e.g. want to "access" some object or when you look at a tree that references the same object twice. For example when both references are for file entries, you know that those files have the same content. That it is a hash doesn't matter, the id could be anything that uniquely identifies an object. The "same object ==> same content" part should be pretty obvious, so you only need to know that the "same name/id ==> same object" part is true, i.e. that the object name/id uniquely identifies the object. And that _is_ true, simply because you cannot have two objects in the same repo that have the same hash and thus the same id. Even if you get a collision, you'll still have just one object. And that's not something that a term that contains the word "hash" is telling me, it would instead tell me that it is not something that really uniquely identifies an object, although git uses it as such. Only when you want to explain how git manages to avoid duplicated storage of fully identical contents, then you need to mention that the object names are the hashes of the full object contents. But that's not what you actually use the object names for. same content ==> same content hash ==> object name/id ==> same object (Actually, you need an additional detail: "same file/symlink/directory/... contents ==> same object contents", which can't be made explicit by just saying that you use a hash). Your chain was in the wrong order and explains neither the "a tree that has the same object name/id for two entries" case (because of the uncertainity of the "same hash ?=> same content" part), nor, when read in the other direction, where all implications are true, why same content leads to the same object (as it already starts at the object level). >> You can still say that the name/id is generated by using a hash >> function, but the important part is that the name/id is used to >> _uniquely_ identify an object, which isn't apparent when you call it >> a hash. > > I think the implication is important in both directions. Neither one is > self-evident to a new user. Maybe the right answer is 'hash id'. git could work different. Just moving the storage of the filenames from the tree objects to the blobs would mean that you'd get different objects for files that have the same content but different names. You'd still have a hash of the object contents as the object name, but suddenly you get more objects. Just saying "hash" or "hash id" doesn't magically explain all the other things. Björn -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html