Re: Import/Export as a fast way to purge files from Git?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Nov 01 2018, Elijah Newren wrote:

> On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
> <larsxschneider@xxxxxxxxx> wrote:
>> > On Sep 24, 2018, at 7:24 PM, Elijah Newren <newren@xxxxxxxxx> wrote:
>> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider <larsxschneider@xxxxxxxxx> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I recently had to purge files from large Git repos (many files, many commits).
>> >> The usual recommendation is to use `git filter-branch --index-filter` to purge
>> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> >> remove the `builtin` directory from git core). I realized that I can remove
>> >> files *way* faster by exporting the repo, removing the file references,
>> >> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> >> the `builtin` directory from git core). Do you see any problem with this
>> >> approach?
>> >
>> > It looks like others have pointed you at other tools, and you're
>> > already shifting to that route.  But I think it's a useful question to
>> > answer more generally, so for those that are really curious...
>> >
>> >
>> > The basic approach is fine, though if you try to extend it much you
>> > can run into a few possible edge/corner cases (more on that below).
>> > I've been using this basic approach for years and even created a
>> > mini-python library[1] designed specifically to allow people to create
>> > "fast-filters", used as
>> >   git fast-export <options> | your-fast-filter | git fast-import <options>
>> >
>> > But that library didn't really take off; even I have rarely used it,
>> > often opting for filter-branch despite its horrible performance or a
>> > simple fast-export | long-sed-command | fast-import (with some extra
>> > pre-checking to make sure the sed wouldn't unintentionally munge other
>> > data).  BFG is great, as long as you're only interested in removing a
>> > few big items, but otherwise doesn't seem very useful (to be fair,
>> > it's very upfront about only wanting to solve that problem).
>> > Recently, due to continuing questions on filter-branch and folks still
>> > getting confused with it, I looked at existing tools, decided I didn't
>> > think any quite fit, and started looking into converting
>> > git_fast_filter into a filter-branch-like tool instead of just a
>> > libary.  Found some bugs and missing features in fast-export along the
>> > way (and have some patches I still need to send in).  But I kind of
>> > got stuck -- if the tool is in python, will that limit adoption too
>> > much?  It'd be kind of nice to have this tool in core git.  But I kind
>> > of like leaving open the possibility of using it as a tool _or_ as a
>> > library, the latter for the special cases where case-specific
>> > programmatic filtering is needed.  But a developer-convenience library
>> > makes almost no sense unless in a higher level language, such as
>> > python.  I'm still trying to make up my mind about what I want (and
>> > what others might want), and have been kind of blocking on that.  (If
>> > others have opinions, I'm all ears.)
>>
>> That library sounds like a very interesting idea. Unfortunately, the
>> referenced repo seems not to be available anymore:
>>     git://gitorious.org/git_fast_filter/mainline.git
>
> Yeah, gitorious went down at a time when I was busy with enough other
> things that I never bothered moving my repos to a new hosting site.
> Sorry about that.
>
> I've got a copy locally, but I've been editing it heavily, without the
> testing I should have in place, so I hesitate to point you at it right
> now.  (Also, the old version failed to handle things like --no-data
> output, which is important.)  I'll post an updated copy soon; feel
> free to ping me in a week if you haven't heard anything yet.
>
>> I very much like Python. However, more recently I started to
>> write Git tools in Perl as they work out of the box on every
>> machine with Git installed ... and I think Perl can be quite
>> readable if no shortcuts are used :-).
>
> Yeah, when portability matters, perl makes sense.  I thought about
> switching it over, but I'm not sure I want to rewrite 1-2k lines of
> code.  Especially since repo-filtering tools are kind of one-shot by
> nature, and only need to be done by one person of a team, on one
> specific machine, and won't affect daily development thereafter.
> (Also, since I don't depend on any libraries and use only stuff from
> the default python library, it ought to be relatively portable
> anyway.)

FWIW I'd be very happy to have this tool itself included in git.git
if/when it's stable / useful enough, and as you point out the language
doesn't really matter as much as what features it exposes.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux