On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider <larsxschneider@xxxxxxxxx> wrote: > > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@xxxxxxxxx> wrote: > > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@xxxxxxxxx> wrote: > >> > >> Hi, > >> > >> I recently had to purge files from large Git repos (many files, many commits). > >> The usual recommendation is to use `git filter-branch --index-filter` to purge > >> files. However, this is *very* slow for large repos (e.g. it takes 45min to > >> remove the `builtin` directory from git core). I realized that I can remove > >> files *way* faster by exporting the repo, removing the file references, > >> and then importing the repo (see Perl script below, it takes ~30sec to remove > >> the `builtin` directory from git core). Do you see any problem with this > >> approach? > > > > It looks like others have pointed you at other tools, and you're > > already shifting to that route. But I think it's a useful question to > > answer more generally, so for those that are really curious... > > > > > > The basic approach is fine, though if you try to extend it much you > > can run into a few possible edge/corner cases (more on that below). > > I've been using this basic approach for years and even created a > > mini-python library[1] designed specifically to allow people to create > > "fast-filters", used as > > git fast-export <options> | your-fast-filter | git fast-import <options> > > > > But that library didn't really take off; even I have rarely used it, > > often opting for filter-branch despite its horrible performance or a > > simple fast-export | long-sed-command | fast-import (with some extra > > pre-checking to make sure the sed wouldn't unintentionally munge other > > data). BFG is great, as long as you're only interested in removing a > > few big items, but otherwise doesn't seem very useful (to be fair, > > it's very upfront about only wanting to solve that problem). > > Recently, due to continuing questions on filter-branch and folks still > > getting confused with it, I looked at existing tools, decided I didn't > > think any quite fit, and started looking into converting > > git_fast_filter into a filter-branch-like tool instead of just a > > libary. Found some bugs and missing features in fast-export along the > > way (and have some patches I still need to send in). But I kind of > > got stuck -- if the tool is in python, will that limit adoption too > > much? It'd be kind of nice to have this tool in core git. But I kind > > of like leaving open the possibility of using it as a tool _or_ as a > > library, the latter for the special cases where case-specific > > programmatic filtering is needed. But a developer-convenience library > > makes almost no sense unless in a higher level language, such as > > python. I'm still trying to make up my mind about what I want (and > > what others might want), and have been kind of blocking on that. (If > > others have opinions, I'm all ears.) > > That library sounds like a very interesting idea. Unfortunately, the > referenced repo seems not to be available anymore: > git://gitorious.org/git_fast_filter/mainline.git Yeah, gitorious went down at a time when I was busy with enough other things that I never bothered moving my repos to a new hosting site. Sorry about that. I've got a copy locally, but I've been editing it heavily, without the testing I should have in place, so I hesitate to point you at it right now. (Also, the old version failed to handle things like --no-data output, which is important.) I'll post an updated copy soon; feel free to ping me in a week if you haven't heard anything yet. > I very much like Python. However, more recently I started to > write Git tools in Perl as they work out of the box on every > machine with Git installed ... and I think Perl can be quite > readable if no shortcuts are used :-). Yeah, when portability matters, perl makes sense. I thought about switching it over, but I'm not sure I want to rewrite 1-2k lines of code. Especially since repo-filtering tools are kind of one-shot by nature, and only need to be done by one person of a team, on one specific machine, and won't affect daily development thereafter. (Also, since I don't depend on any libraries and use only stuff from the default python library, it ought to be relatively portable anyway.)