On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > On Tue, Jul 31, 2012 at 11:01:27PM -0700, Junio C Hamano wrote: > > ... > >> As we still have the pathname in this codepath, I am wondering if we > >> would benefit from custom "content hash" that knows the nature of > >> payload than the built-in similarity estimator, driven by the > >> attribute mechanism (if the latter is the case, that is). > > > > Hmm. Interesting. But I don't think that attributes are a good fit here. > > They are pathname based, so how do I apply anything related to > > similarity of a particular version by pathname? IOW, how does it apply > > in one tree but not another? > > When you move porn/0001.jpg in the preimage to naughty/00001.jpg in > the postimage, they both can hit "*.jpg contentid=jpeg" line in the > top-level .gitattribute file, and the contentid driver for jpeg type > may strip exif and hash the remainder bits in the image to come up > with a token you can use in a similar way as object ID is used in > the exact rename detection phase. > > Just thinking aloud. Ah, I see. That still feels like way too specific a use case to me. A much more general use case to me would be a contentid driver which splits the file into multiple chunks (which can be concatenated to arrive at the original content), and marks chunks as "OK to delta" or "not able to delta". In other words, a content-specific version of the bup-style splitting that people have proposed. Assuming we split a jpeg into its EXIF bits (+delta) and its image bits (-delta), then you could do a fast rename or pack-objects comparison between two such files (in fact, with chunked object storage, pack-objects can avoid looking at the image parts at all). However, it may be the case that such "smart" splitting is not necessary, as stupid and generic bup-style splitting may be enough. I really need to start playing with the patches you wrote last year that started in that direction. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html