A fast alternative to git-filter-branch - The BFG Repo-Cleaner

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I recently released The BFG Repo-Cleaner, a new tool for cleansing bad
data out of Git repository histories. The BFG is typically at least
10-50x faster than git-filter-branch at these tasks:

* Removing Crazy Big Files from repo history
* Removing Passwords, Credentials & other Private data

http://rtyley.github.com/bfg-repo-cleaner/

As an example, these are timings for deleting an arbitrary file from
the large GCC repository (148495 commits):

The BFG : 3m29s
$ bfg -D README-fixinc

git filter-branch : 472m31s
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch
gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all

(roughly a 135x speed increase, reducing the task of processing a
large codebase from an overnight job to the work of a few minutes....
all timings done in a 4GB tmpfs ramdisk)


The BFG has some simple but very powerful command-line options, which
perform at similar speed:

remove all blobs bigger than 1 megabyte :
$ bfg --strip-blobs-bigger-than 1M  my-repo.git

replace all passwords (listed in a file 'passwords.txt') with ***REMOVED*** :
$ bfg --replace-banned-strings passwords.txt  my-repo.git


The main source of the BFG's performance advantage comes from
preventing repeated examination of the same tree objects. The approach
of git-filter-branch performs filtering for each commit, against the
complete file-hierarchy of each commit, one after the other, even
though commit trees are largely very similar. For the use-cases of The
BFG that's unnecessary- we don't care where, and in which commit, a
'bad' file exists - we just want it dealt with. Consequently the BFG
processes the Git object db on a memoised tree-by-tree basis,
processing each and every file & folder exactly once - the final
processing of the commit hierarchy is very quick. This _does_ mean
that it's not possible to delete files based on their absolute path
within the repo, but they can deleted based on their filename,
blob-id, or contents. This, and multi-core processing by default,
gives the dramatic speed-up while still providing the same results.
There's more performance data here:
https://docs.google.com/spreadsheet/ccc?key=0AsR1d5Zpes8HdER3VGU1a3dOcmVHMmtzT2dsS2xNenc

I'd welcome feedback, and if anyone has cause to filter a repository's
history in future, I'd appreciate you giving the BFG a try and letting
me know how you found it.

thanks,
Roberto Tyley
software dev @ The Guardian

http://rtyley.github.com/bfg-repo-cleaner/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]