I recently released The BFG Repo-Cleaner, a new tool for cleansing bad data out of Git repository histories. The BFG is typically at least 10-50x faster than git-filter-branch at these tasks: * Removing Crazy Big Files from repo history * Removing Passwords, Credentials & other Private data http://rtyley.github.com/bfg-repo-cleaner/ As an example, these are timings for deleting an arbitrary file from the large GCC repository (148495 commits): The BFG : 3m29s $ bfg -D README-fixinc git filter-branch : 472m31s $ git filter-branch --index-filter 'git rm --cached --ignore-unmatch gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all (roughly a 135x speed increase, reducing the task of processing a large codebase from an overnight job to the work of a few minutes.... all timings done in a 4GB tmpfs ramdisk) The BFG has some simple but very powerful command-line options, which perform at similar speed: remove all blobs bigger than 1 megabyte : $ bfg --strip-blobs-bigger-than 1M my-repo.git replace all passwords (listed in a file 'passwords.txt') with ***REMOVED*** : $ bfg --replace-banned-strings passwords.txt my-repo.git The main source of the BFG's performance advantage comes from preventing repeated examination of the same tree objects. The approach of git-filter-branch performs filtering for each commit, against the complete file-hierarchy of each commit, one after the other, even though commit trees are largely very similar. For the use-cases of The BFG that's unnecessary- we don't care where, and in which commit, a 'bad' file exists - we just want it dealt with. Consequently the BFG processes the Git object db on a memoised tree-by-tree basis, processing each and every file & folder exactly once - the final processing of the commit hierarchy is very quick. This _does_ mean that it's not possible to delete files based on their absolute path within the repo, but they can deleted based on their filename, blob-id, or contents. This, and multi-core processing by default, gives the dramatic speed-up while still providing the same results. There's more performance data here: https://docs.google.com/spreadsheet/ccc?key=0AsR1d5Zpes8HdER3VGU1a3dOcmVHMmtzT2dsS2xNenc I'd welcome feedback, and if anyone has cause to filter a repository's history in future, I'd appreciate you giving the BFG a try and letting me know how you found it. thanks, Roberto Tyley software dev @ The Guardian http://rtyley.github.com/bfg-repo-cleaner/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html