On Wed, Aug 01, 2012 at 11:36:00AM +0700, Nguyen Thai Ngoc Duy wrote: > > That is orthogonal to the issue of what is being stored. I chose my > > mmap'd disk implementation because it is very fast, which makes it nice > > for a performance cache. But you could store the same thing in git-notes > > (indexed by dst sha1, I guess, and then pointing to a blob of (src, > > score) pairs. > > > > If you want to include path-based hints in a commit, I'd say that using > > some micro-format in the commit message would be the simplest thing. > > Rename correction is after the commit is created. I don't think we can > recreate commits. Yes, if you go with a commit-based approach, you can do either notes or in-commit messages. In other words, I would break the solutions down as: 1. Store sha1+sha1 -> score mapping (i.e., what I suggested). This is fundamentally a global store, not a per-commit store. For storage, you can do one (or a combination) of: a. Store the mapping in some local file. Fast, but can't be shared. b. Store the mapping in a note (probably indexed by the destination blob sha1). Less efficient, but easy to share. 2. Store path -> path mapping. This is fundamentally a per-commit store, since it is valid only in the diff between two particular trees. For storage, you can do one (or a combination) of: a. Store the mapping inside the commit object. Simple to share, but hard to change later. b. Store it in a note indexed by the commit object. Slightly harder to share, but can change later. I implemented (1a). Implementing (1b) would be easy, but for a full-on cache (especially for "-C"), I think the resulting size might be prohibitive. All solutions under (2) suffer from the same problem: they are accurate only for a single diff. For other diffs, you would either have to not use the feature, or you would be stuck traversing the history and assigning a temporary file identity (e.g., given commits A->B->C, and in A->B we rename "foo" to "bar", the diff between A and C could discover that A's "foo" corresponds to C's "bar"). That is slower than what we do now, but that is only one problem with it. Another is that the history relationship between the commits might be complex. In the example above, we had a direct-descendent relationship. But what about one with branching, and one side does the rename and the other does not? Or even cherry-picking from an unrelated part of history entirely? And I think all forms of (2) will suffer from this, no matter how you store the data. The reason I mentioned specifically putting it in the commit message (i.e., 2a) because that is what people suggested in the very early days of git (e.g., saying that rename detection is OK, but could we supplement it with hints at commit-time?). And all of the arguments against it back then apply equally now. And apply equally to storing it in notes, because it is just another form of the same thing. > > But that has been discussed before; ultimately the problem is that > > it only covers _one_ diff that we do with that commit (it is > > probably the most common, of course, but it doesn't cover them all). > > How about we generate sha1 mapping from commit hints? We try to take > advantage of path hints when we can. Else we fall back to sha-1 > mapping. This way we can transfer commit hints as git-notes to another > repo, then regenerate sha-1 mapping there. No need to transfer sha1 > maps. Yes. You could definitely pre-seed the sha1 mapping from the per-commit hints. However, if you are interested in overriding git's rename detection, then per-commit hints will only let you do part of what you want. For example, consider a linear history like this: A--B--C And imagine that commit B renamed a file from "foo" to "bar", and that commit C further modified "bar". Let's say that git did not detect the rename in B, but you manually annotated it. As a result, we seed the sha1 mapping to show that the sha1 of A:foo and the sha1 of B:bar has a rename score of 100%. Now I want to do a diff between A and C (e.g., for rename detection on one side of a merge). But their sha1s are not in my mapping (because C:bar is not the same as B:bar), and I don't find the rename. To find it, I cannot seed a per-commit path mapping. I must seed the sha1 mapping itself. So to adapt your per-commit mapping into a sha1 mapping, the seeding process would have to actually walk the history, transitively creating sha1 mapping entries (i.e., seeing that "bar" changed between B and C, and therefore mapping A:foo to C:bar, as well). For this reason, I'm not sure that stored overrides like this are generally useful in the long run. I think storage is useful for _caching_ the results, because it doesn't have to be perfect; it just helps with some repetitive queries. Whereas for overriding, I think it is much more interesting to override _particular_ diff. E.g., to say "I am merging X and Y, and please pretend that Y renamed "foo" to "bar" when you do rename detection. And in that sense, your "git log" example can be considered a special-case of this: you are saying that the diff from $commit to $commit^ is done frequently, so rather than saying "please pretend..." each time, you would like to store the information forever. And storing it in the commit message or a note is one way of doing that. I don't think there's anything fundamentally _wrong_ with that, but I kind of question its usefulness. In other words, what is the point in doing so? If it is inform the user that semantically the commit did a rename, even though the content changed enough that rename detection does not find it, then I would argue that you should simply state it in the commit message (or in a human-readable git-note, if it was only realized after the fact). But there is not much point in making it machine-readable, since the interesting machine-readable things we do with renames are: 1. Show the diff against the rename src, which can often be easier to read. Except that if rename detection did not find it, it is probably _not_ going to be easier to read. 2. Applying content to the destination of a merge. But you're almost never doing the diff between a commit and its parent, so the information would be useless. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html