On Thu, Jan 31, 2019 at 8:09 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Elijah Newren <newren@xxxxxxxxx> writes: > > > git-filter-repo[1], a filter-branch-like tool for rewriting repository > > history, is ready for more widespread testing and feedback. The rough > > edges I previously mentioned have been fixed, and it has several useful > > features already, though more development work is ongoing (docs are a > > bit sparse right now, though -h provides some help). > > > > Why filter-repo vs. filter-branch? > > How does it compare with bfg-repo-cleaner? Somehow I was led to > believe that all serious users of filter-branch like functionality > are using bfg-repo-cleaner instead. No, bfg-repo-cleaner only covers an important subset of the usecases. bfg-repo-cleaner does a really good job if your goal is to remove a few big files and/or to remove some sensitive text (matched via regexes) from all blobs. It was designed for that specific role and has more options in this area than filter-repo currently has. But even within this design space it was optimized for, it is missing two things that I really want: * pruning of commits which become empty due to filtering * providing a way for the user to know what needs to be cleaned up. It has options like --strip-blobs-bigger-than <size> or --strip-biggest-blobs <NUM>, but no way for the user to figure out what <size> or <NUM> should be. Also, since it just focuses on really big blobs, it misses cases like someone checking in directories with a huge number of small-to-moderately sized files (e.g. bower_components/ or node_modules/, though these could also contain a few big blobs too), or someone checking in a lot of moderately sized files of a uniform extension (e.g. .webm, .tar.gz, .zip, .mp4, .avi). I've seen cases in the wild where the correct cleaning of history was more about filtering out directories or extensions than a couple big files. filter-repo's --analyze option creates some reports that help with this tremendously. Also, the options to delete files by glob/basename overlook the fact that renames may have occurred. Having a report that mentions renames that have occurred in history (also part of filter-repo's --analyze option) can be very helpful. Outside of this specific usecase, bfg-repo-cleaner is not very useful. It simply lacks more general filtering capabilties: * While bfg-repo-cleaner has facilities to remove certain paths, it has none to say you only want to keep certain paths. Unlike filter-branch where you can use a pipeline to list all files, grep to remove the ones you want to keep from the list, then pipe the remainder of paths to xargs git rm, bfg-repo-cleaner doesn't have a facility for shell commands. Instead in bfg-repo-cleaner you would need to emulate this by exhaustively listing directories and paths/globs of file basenames to delete, but that assumes the user knows all paths that have ever existed making this solution not only onerous but error prone. More of the filterings I see these days are about just keeping a directory (or perhaps a handful of them) rather than just removing or cleaning a few files. Also, this makes pruning of commits which become empty much more important, but as noted above, bfg-repo-cleaner lacks that ability. * It has no facilities for renaming paths. You'd have to use a different tool to do that, but then why not use the other tool to do the whole job? Even if you do decide to use both tools, some capabilities of one tool can be neutered by such an approach (e.g. bfg-repo-cleaner's carefully rewritten commit messages that tried to ensure abbreviated commit shas referred to the new commit ids) * It has no facilities for affecting other parts of history, such as changing author/committer/tagger names or emails, changing commit timestamp or timezone, reparenting commits, splicing repository histories together, filtering files differently based on commit timestamp, etc. -- all of which can be done with filter-repo (though some of those things requires writing a small python script; see basic examples in t/lib-usage/*) Personally, I also find it kind of annoying that bfg-repo-cleaner doesn't automatically repack and shrink the repo when it is done and instead prints multiple commands the user can run to achieve that, even though it's the core use case for the tool. Granted, they may have had last-ditch recovery-of-the-original-repo in mind in case the user ran in a repository they shouldn't have, but I much prefer to have the tool just check if the repo looks like a fresh clone and bail if not, so that users have a far easier recovery mechanism -- just throw away the clone you were filtering and re-clone. Once you do that, auto repacking and shrinking is pretty natural. (And you can always provide a --force option to allow filtering & rewriting in a repo that isn't a fresh clone.) Elijah