Re: fast-import slowness when importing large files with small differences

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jun 29 2018, Mike Hommey wrote:

> On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Fri, Jun 29 2018, Mike Hommey wrote:
>>
>> > I noticed some slowness when fast-importing data from the Firefox mercurial
>> > repository, where fast-import spends more than 5 minutes importing ~2000
>> > revisions of one particular file. I reduced a testcase while still
>> > using real data. One could synthesize data with kind of the same
>> > properties, but I figured real data could be useful.
>> >
>> > To reproduce:
>> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
>> > $ git init bar
>> > $ cd bar
>> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
>> >
>> > [...]
>> > So maybe it would make sense to consolidate the diff code (after all,
>> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
>> > of common head and tail, this gets down to 3:33.
>> >
>> > I'll also note that Facebook has imported xdiff from the git code base
>> > into mercurial and improved performance on it, so it might also be worth
>> > looking at what's worth taking from there.
>>
>> It would be interesting to see how does this compares with a more naïve
>> approach of committing every version of this file one-at-a-time into a
>> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
>> suboptimal compared to just writing out a lot of redundant data and
>> repacking it all at once later.
>
> "Just" writing 26GB? And that's only one file. If I were to do that for
> the whole repository, it would yield a > 100GB pack. Instead of < 2GB
> currently.

To clarify on my terse response. I mean to try this on an isolated test
case to see to what extent the problem you're describing is unique to
fast-import, and to what extent it's encountered during "normal" git use
when you commit all the revisions of that file in succession.

Perhaps the difference between the two would give some hint as to how to
proceed, or not.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux