Jeff King venit, vidit, dixit 28.03.2010 16:53: > The normal textconv procedure is to dump the binary file to a tempfile > (optionally using a working tree file if available), then run the > textconv helper to produce a textual version on stdout. This is a very > convenient interface, as helpers don't need to be aware of git at all > and many standard commands can be used without wrappers. > > Unfortunately, it can be slow for large binary files. We spool the file > to disk before invoking the textconv helper, so the helper has no way to > do any optimizations. For example, the helper may need only part of the > file (e.g., when showing metadata at the beginning of a media file), or > it may implement a caching scheme to avoid repeating expensive > conversions. > > This series introduces a "fast textconv", which does not automatically > spool a tempfile, but instead gives the helper program the sha1 of the > blob to be converted. > > Here are some timings from my photo repository, on a commit with 37 > JPEGs and 8 AVIs. Each file had two lines added to its exif metadata. > My textconv helper is a perl script that dumps the exif tags, and > implements its own caching scheme. > > $ time git show >/dev/null ;# before patch > real 0m13.818s > user 0m12.137s > sys 0m1.552s > > $ time git show >/dev/null ;# after patch, first run > real 0m15.076s > user 0m13.321s > sys 0m1.772s > > $ time git show >/dev/null ;# after patch, subsequent runs > real 0m2.502s > user 0m1.820s > sys 0m0.592s > > So you can see a 5.5x speedup. The first run is a little bit slower, > presumably due to the extra git-cat-file calls by the helper. > > The speedup is purely from caching; I am not using the "we only need to > read the first part of the file" optimization. My files are only a few > megabytes. Probably that would be more useful for people storing files > in the hundreds of megabytes, where a full cat-file will cause a lot of > unwanted I/O. > > There are two things I'm still not 100% happy with: > > 1. 2.5 seconds is still a little slower than I would like. The slowness > comes from the fact that my helper is written in perl, and therefore > perl gets invoked for each diff. I could try collecting all of the > to-be-textconv'd files at the beginning of the diff process and just > invoking the helper once. But that means we need to store the > results in core, and they could potentially be long (in my case, > they are only a few hundred bytes, but somebody could potentially be > textconv'ing a large documents). > > 2. It is up to the helper to implement a caching layer. This offers a > lot of flexibility, but it means each helper must implement its own. > It also means we have to run the helper even for a cache hit, which > causes slowness. > > An alternative would be for git to support textconv caching > natively, probably by using the notes mechanism to map blob sha1's > to their textconv'd contents. But that opens a whole can of worms > with how the cache is managed. If I change my textconv helper to > produce different results, how do I invalidate the cache? Should it > happen automatically if I change the contents of > diff.$method.textconv? Or do I need to do it manually (you will > still need to do it manually if, e.g., you upgrade your textconv > helper. Git can't know about that). How do I evict entries if the > cache gets too large when notes are stored as a history? Really, "Notes!" was my first thought even before reading 2. Happy to have found a like mind :) This would still need a mechanism where the conv helper gets the blob's SHA1 - hey, it's there in your patch... How about: Set fasttextconv=notestextconv notestextconv does the following: - If $sha1 has a note in refs/notes/bikeshed display it. - If not create one and then display it. In fact, the creation could be done using the textconv setting! Pruning the cache is done be deleting the refs/notes/bikeshed ref, truncating it by truncating it's DAG (filter-branch...). Cheers, Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html