Hi, On Thu, Jan 31, 2019 at 12:57 AM Elijah Newren <newren@xxxxxxxxx> wrote: > git-filter-repo[1], a filter-branch-like tool for rewriting repository > history, is ready for more widespread testing and feedback. The rough Someone at the Contributor Summit (Michael Haggerty perhaps?) asked me about performance numbers on known repositories for filter-repo and how it compared to other tools; I gave extremely rough estimates, but here I belatedly provide some more detailed figures. In each case, I report both filtering time, and cleanup (gc or clone) time[0]: Testcase 1: Remove a single file (configure.ac) from each commit in git.git: * filter-branch[1a]: 2413.978s + 34.812s * BFG (8-core)[1b]: 38.743s + 30.333s * BFG (40-core)[1b]: 24.680s + 35.165s * filter-repo[1c]: 35.582s + 15.690s Caveats: filter-repo failed and needed workarounds; see [1d] Testcase 2: Keep two directories (guides/ and tools/) from rails.git: * filter-branch[2a]: 14586.655s + 22.726s * BFG (8-core)[2b]: 27.675s + 15.786s * BFG (40-core)[2b]: 24.883s + 20.463s * filter-repo[2c]: 10.951s + 12.500s Caveats: filter-branch failed at the end of this operation; see [2d]. AFAICT, BFG can't do this operation; used approximations instead[2e]. Testcase 3: Replacing one string with another throughout all files in linux.git: * filter-branch[3a]: Estimated at about 3.5 months (~8.9e6 seconds) * BFG (8-core)[3b]: 2144.904s + 693.79s * BFG (40-core)[3b]: 1178.577s + 636.887s * filter-repo[3c]: 1203.147s + 159.620s Caveats: filter-branch failed at ~12 hours; see [3d]. Other details about measurements at [4]. Take-aways and biased opinions at [5]. Hope this was interesting, Elijah *************** Footnotes (Minutiae for the curious) *************** [0] git-filter-branch's manpage suggests re-cloning to get rid of old objects, BFG as its last step provides the user commands to execute in order to clean out old objects, and filter-repo automatically runs such commands. As such, time of post-run gc seems like a relevant thing to report. Commands used and timed: * filter-branch: time git clone file://$(pwd) ../nuke-me-clone * BFG: git reflog expire --expire=now --all && time git gc --prune=now * filter-repo: N/A (internally runs same commands as I manually ran for BFG) [1a] time git filter-branch --index-filter 'git rm --quiet --cached --ignore-unmatch configure.ac' --tag-name-filter cat --prune-empty -- --all [1b] time java -jar ~/Downloads/bfg-1.13.0.jar --delete-files configure.ac [1c] git tag | grep v1.0rc | xargs git tag -d git tag -d junio-gpg-pub time git filter-repo --path configure.ac --invert-paths [1d] git fast-export when run with certain flags will abort in repos with tags of blobs or tags of tags. I had to first delete 7 tags to get this testcase to run, as shown in the commands above in [1c]. I'll probably patch fast-export to fix this. [2a] time git filter-branch --index-filter 'git ls-files -z | tr "\0" "\n" | grep -v -e ^guides/ -e ^tools/ | tr "\n" "\0" | xargs -0 git rm --quiet --cached --ignore-unmatch' --tag-name-filter cat --prune-empty -- --all [2b] git log --format=%n --name-only | sort | uniq | grep -v ^$ > all-files.txt time java -jar ~/Downloads/bfg-1.13.0.jar --delete-folders "{$(grep / all-files.txt | sed -e 's/"//' -e s%/.*%% | uniq | grep -v -e guides -e tools | tr '\n' ,)}" --delete-files "{$(comm -23 <(grep -v / all-files.txt) <(grep -e guides/ -e tools/ all-files.txt | sed -e s%.*/%% | sort) | tr '\n' ,)}" [2c] time git filter-repo --path guides --path tools [2d] filter-branch fails at the very end when noting which refs were deleted/rewritten with: error: cannot lock ref 'refs/tags/v0.10.0': is at b68b47672e613e94a7859c9549e9cd4b401f7b79 but expected e2724aa1856253f4fc48ddc251583042c5f06029 Could not delete refs/tags/v0.10.0 Turns out b68b47672e613e94a7859c9549e9cd4b401f7b79 is an annotated tag in the original repo pointing to the commit e2724aa1856253f4fc48ddc251583042c5f06029. I do not know the cause of this bug, but since it was almost at the very end, I just reported the time used before it hit this error. [2e] Unless I am misunderstanding, BFG is not capable of this filtering operation because it uses basenames for --delete-files and --delete-folders, and some names appear in several directories (e.g. .gitignore, Rakefile, tasks). As such, with the BFG you either have to delete files/directories that shouldn't be, or leave files and folders around that you wanted to have deleted. The command in [2b] has some of both, but should still give a good estimate of how long it would take BFG to do this kind of operation if file and directory basenames in the rails repository happened to be named uniquely. [3a] time git filter-branch -d /dev/shm/tmp --tree-filter 'git ls-files | xargs sed -i s/secretly/covertly/' --tag-name-filter cat -- --all [3b] time java -jar ~/Downloads/bfg-1.13.0.jar --replace-text <(echo 'secretly==>covertly') [3c] time git filter-repo --replace-text <(echo 'secretly==>covertly') [3d] filter-branch failed after 45704 seconds, predicting another 8836429 seconds (~102 days) remaining at the time. As commits earlier in history tend to be smaller, filter-branch nearly always underestimates the time required, sometimes considerably. filter-branch failed on commit af25e94d4dcfb9608846242fabdd4e6014e5c9f0 due to an empty ident. I possibly could have worked around it with --env-filter, but it's not like I'm going to wait for it to finish anyway. [4] Other notes about timings: * All tests were run on an 8 cpu system, except for the "BFG 40-core" tests which were run on a 40 core system. (filter-branch and filter-repo are not multi-threaded and gain nothing from more cores.) * More precisely, I ran on AWS with an m4.2xlarge with two 50-GB GP2 volumes (150 Iops) for tests. The 40-core system was an m4.10xlarge. * Before each command, to try to avoid warm disk caches helping or hurting depending on the order I ran commands in, I first ran: * rsync -az --delete ../$REPO-orig/ ./ * git status * $TOOL -h * Testing was imperfect; I just ran once and recorded the time. It took long enough to gather the data as it was. * when additional commands were needed for the filtering (e.g. getting the all-files.txt list to generate the BFG command, or deleting tags that fast-export couldn't handle for filter-repo), I did not include the times of those commands in the overall execution time. It would have added a few hundredths of a second to filter-repo's git.git time, and about 5-6 seconds to BFG's rails.git time. * filter-repo self-reports time until filtering finishes and time until entirely done. I took difference between its self-report of overall time and the "time" command's report of overall time (which was typically order ~ 0.1s), and added that to filter-repo's filtering time, assuming that most the discrepancy would be due to python startup. [5] Performance is only one measurement. Features, capabilities, usability, etc. matter too. filter-branch is a general purpose filtering tool, but in my opinion, not a good one -- and not just because of performance. BFG Repo Cleaner is a good tool, but it is special purpose; it is designed for a few particular usecases (limiting the kinds of things I could try in my comparison above). My hope is that filter-repo serves as a good general purpose filtering tool so that people can stop suffering from filter-branch.