On Thu, Aug 02, 2012 at 03:51:17PM -0700, Junio C Hamano wrote: > > On Wed, Aug 01, 2012 at 03:10:55PM -0700, Junio C Hamano wrote: > > ... > >> When you move porn/0001.jpg in the preimage to naughty/00001.jpg in > >> the postimage, they both can hit "*.jpg contentid=jpeg" line in the > >> top-level .gitattribute file, and the contentid driver for jpeg type > >> may strip exif and hash the remainder bits in the image to come up > >> with a token you can use in a similar way as object ID is used in > >> the exact rename detection phase. > >> > >> Just thinking aloud. > > > > Ah, I see. That still feels like way too specific a use case to me. A > > much more general use case to me would be a contentid driver which > > splits the file into multiple chunks (which can be concatenated to > > arrive at the original content), and marks chunks as "OK to delta" or > > "not able to delta". In other words, a content-specific version of the > > bup-style splitting that people have proposed. > > > > Assuming we split a jpeg into its EXIF bits (+delta) and its image bits > > (-delta), then you could do a fast rename or pack-objects comparison > > between two such files (in fact, with chunked object storage, > > pack-objects can avoid looking at the image parts at all). > > > > However, it may be the case that such "smart" splitting is not > > necessary, as stupid and generic bup-style splitting may be enough. I > > really need to start playing with the patches you wrote last year that > > started in that direction. > > I wasn't interested in "packing split object representation", > actually. The idea was still within the context of "rename". But it would work for rename, too. If you want to compare two files, the driver would give you back { sha1_exif (+delta), sha1_image (-delta) } for each file. You know the size of each chunk and the size of the total file. Then you would just compare sha1_image for each entry. If they match, then you have a lower bound on similarity of image_chunk_size / total_size. If they don't, then you have an upper bound of similarity of 1-(image_chunk_size/total_size). In the former case, you can get the exact similarity by doing a real delta on the sha1_exif content. In the latter case, you can either exit early (if you are already below the similarity threshold, which is likely), or possibly do the delta on the sha1_exif content to get an exact value. But either way, you never had to do a direct comparison between the big image data; you only needed to know the sha1s. And as a bonus, if you did want to cache results, you can have an O(# of blobs) cache of the chunked sha1s of the chunked form (because the information is immutable for a given sha1 and content driver). Whereas by caching the result of estimate_similarity, our worst-case cache is the square of that (because we are storing sha1 pairs). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html