Re: [PATCH 0/3] fast textconv

Michael J Gruber <git@xxxxxxxxxxxxxxxxxxxx> · Sun, 28 Mar 2010 18:09:35 +0200

Jeff King venit, vidit, dixit 28.03.2010 16:53:
> The normal textconv procedure is to dump the binary file to a tempfile
> (optionally using a working tree file if available), then run the
> textconv helper to produce a textual version on stdout. This is a very
> convenient interface, as helpers don't need to be aware of git at all
> and many standard commands can be used without wrappers.
> 
> Unfortunately, it can be slow for large binary files. We spool the file
> to disk before invoking the textconv helper, so the helper has no way to
> do any optimizations. For example, the helper may need only part of the
> file (e.g., when showing metadata at the beginning of a media file), or
> it may implement a caching scheme to avoid repeating expensive
> conversions.
> 
> This series introduces a "fast textconv", which does not automatically
> spool a tempfile, but instead gives the helper program the sha1 of the
> blob to be converted.
> 
> Here are some timings from my photo repository, on a commit with 37
> JPEGs and 8 AVIs. Each file had two lines added to its exif metadata.
> My textconv helper is a perl script that dumps the exif tags, and
> implements its own caching scheme.
> 
>   $ time git show >/dev/null ;# before patch
>   real    0m13.818s
>   user    0m12.137s
>   sys     0m1.552s
> 
>   $ time git show >/dev/null ;# after patch, first run
>   real    0m15.076s
>   user    0m13.321s
>   sys     0m1.772s
> 
>   $ time git show >/dev/null ;# after patch, subsequent runs
>   real    0m2.502s
>   user    0m1.820s
>   sys     0m0.592s
> 
> So you can see a 5.5x speedup. The first run is a little bit slower,
> presumably due to the extra git-cat-file calls by the helper.
> 
> The speedup is purely from caching; I am not using the "we only need to
> read the first part of the file" optimization. My files are only a few
> megabytes. Probably that would be more useful for people storing files
> in the hundreds of megabytes, where a full cat-file will cause a lot of
> unwanted I/O.
> 
> There are two things I'm still not 100% happy with:
> 
>  1. 2.5 seconds is still a little slower than I would like. The slowness
>     comes from the fact that my helper is written in perl, and therefore
>     perl gets invoked for each diff. I could try collecting all of the
>     to-be-textconv'd files at the beginning of the diff process and just
>     invoking the helper once. But that means we need to store the
>     results in core, and they could potentially be long (in my case,
>     they are only a few hundred bytes, but somebody could potentially be
>     textconv'ing a large documents).
> 
>  2. It is up to the helper to implement a caching layer. This offers a
>     lot of flexibility, but it means each helper must implement its own.
>     It also means we have to run the helper even for a cache hit, which
>     causes slowness.
> 
>     An alternative would be for git to support textconv caching
>     natively, probably by using the notes mechanism to map blob sha1's
>     to their textconv'd contents. But that opens a whole can of worms
>     with how the cache is managed. If I change my textconv helper to
>     produce different results, how do I invalidate the cache? Should it
>     happen automatically if I change the contents of
>     diff.$method.textconv? Or do I need to do it manually (you will
>     still need to do it manually if, e.g., you upgrade your textconv
>     helper. Git can't know about that). How do I evict entries if the
>     cache gets too large when notes are stored as a history?

Really, "Notes!" was my first thought even before reading 2. Happy to
have found a like mind :)

This would still need a mechanism where the conv helper gets the blob's
SHA1 - hey, it's there in your patch...

How about:

Set fasttextconv=notestextconv

notestextconv does the following:

- If $sha1 has a note in refs/notes/bikeshed display it.
- If not create one and then display it.

In fact, the creation could be done using the textconv setting!

Pruning the cache is done be deleting the refs/notes/bikeshed ref,
truncating it by truncating it's DAG (filter-branch...).

Cheers,
Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html