On Tue, Apr 22, 2008 at 12:51:14PM -0400, Avery Pennarun wrote: > Do you think git would benefit from having a generalized version of > this script? Basically, the user provides a "munge" script on the > command line, and there's a git-filter-branch mode for auto-munging > (with a cache) every file in every checkin. Even if it's *only* ever > used for CRLF, I can imagine this being useful to a lot of people. It was easy enough to work up the patch below, which allows git filter-branch --blob-filter 'tr a-z A-Z' However, it's _still_ horribly slow. Shell script is nice and flexible, but running a tight loop like this is just painful. I suspect filter-branch in something like perl would be a lot faster and just as flexible (you could even do it in C, but you'd probably have to invent a little domain-specific scripting language). It is still much better performance than a tree filter, though: $ cd git && time git filter-branch --tree-filter ' find . -type f | while read f; do tr a-z A-Z <"$f" >tmp mv tmp "$f" done ' HEAD~10..HEAD real 4m38.626s user 1m32.726s sys 2m51.163s $ cd git && git filter-branch --blob-filter 'tr a-z A-Z' HEAD~10..HEAD real 1m40.809s user 0m36.822s sys 1m14.273s Lots of system time in both. I'm sure we spend a fair bit of time hitting our very large map and blob-cache directories, which would be much more nicely implemented as associative arrays in memory (if we were using a more featureful language). Anyway, here is the patch. I don't know if it is even worth applying, since it is still painfully slow. --- git-filter-branch.sh | 30 ++++++++++++++++++++++++++++++ 1 files changed, 30 insertions(+), 0 deletions(-) diff --git a/git-filter-branch.sh b/git-filter-branch.sh index 333f6a8..0602b25 100755 --- a/git-filter-branch.sh +++ b/git-filter-branch.sh @@ -54,6 +54,23 @@ EOF eval "$functions" +munge_blobs() { + while read mode sha1 stage path + do + if ! test -r "$workdir/../blob-cache/$sha1" + then + new=`git cat-file blob $sha1 | + eval "$filter_blob" | + git hash-object -w --stdin` + printf $new >$workdir/../blob-cache/$sha1 + fi + printf "%s %s\t%s\n" \ + "$mode" \ + $(cat "$workdir/../blob-cache/$sha1") \ + "$path" + done +} + # When piped a commit, output a script to set the ident of either # "author" or "committer @@ -105,6 +122,7 @@ tempdir=.git-rewrite filter_env= filter_tree= filter_index= +filter_blob= filter_parent= filter_msg=cat filter_commit='git commit-tree "$@"' @@ -150,6 +168,9 @@ do --index-filter) filter_index="$OPTARG" ;; + --blob-filter) + filter_blob="$OPTARG" + ;; --parent-filter) filter_parent="$OPTARG" ;; @@ -227,6 +248,9 @@ ret=0 # map old->new commit ids for rewriting parents mkdir ../map || die "Could not create map/ directory" +# cache rewritten blobs for blob filter +mkdir ../blob-cache || die "Could not create blob-cache/ directory" + case "$filter_subdir" in "") git rev-list --reverse --topo-order --default HEAD \ @@ -295,6 +319,12 @@ while read commit parents; do eval "$filter_index" < /dev/null || die "index filter failed: $filter_index" + if test -n "$filter_blob"; then + git ls-files --stage | + munge_blobs | + git update-index --index-info + fi + parentstr= for parent in $parents; do for reparent in $(map "$parent"); do -- 1.5.5.1.144.g4c416.dirty -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html