Re: Make `git fetch --all` parallel?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 11, 2016 at 04:18:15PM -0700, Stefan Beller wrote:

> >> At the very least we would need a similar thing as Jeff recently sent for the
> >> push case with objects quarantined and then made available in one go?
> >
> > I don't think so. The object database is perfectly happy with multiple
> > simultaneous writers, and nothing impacts the have/wants until actual
> > refs are written. Quarantining objects before the refs are written is an
> > orthogonal concept.
> 
> If a remote advertises its tips, we'd need to look these up (clientside) to
> decide if we have them, and I do not think we'd do that via a reachability
> check, but via direct lookup in the object data base? So I do not quite
> understand, what we gain from the atomic ref writes in e.g. remote/origin/.

It's been a while since I've dug into the fetch protocol. But I think we
cover the "do we have the objects already" check via quickfetch(), which
does do a reachability check, And then we advertise our "have" commits
by walking backwards from our ref tips, so everything there is
reachable.

Anything else would be questionable, especially under older versions of
git, as we promise only to have a complete graph for objects reachable
from the refs. Older versions of git would happily truncate unreachable
history based on the 2-week prune expiration period.

> > I'm not altogether convinced that parallel fetch would be that much
> > faster, though.
> 
> Ok, time to present data... Let's assume a degenerate case first:
> "up-to-date with all remotes" because that is easy to reproduce.
> 
> I have 14 remotes currently:
> 
> $ time git fetch --all
> real 0m18.016s
> user 0m2.027s
> sys 0m1.235s
> 
> $ time git config --get-regexp remote.*.url |awk '{print $2}' |xargs
> -P 14 -I % git fetch %
> real 0m5.168s
> user 0m2.312s
> sys 0m1.167s

So first, thank you (and Ævar) for providing real numbers. It's clear
that I was talking nonsense.

Second, I wonder where all that time is going. Clearly there's an
end-to-end latency issue, but I'm not sure where it is. Is it startup
time for git-fetch? Is it in getting and processing the ref
advertisement from the other side? What I'm wondering is if there are
opportunities to speed up the serial case (but nobody really cared
before because it doesn't matter unless you're doing 14 of them back to
back).

> > I usually just do a one-off fetch of their URL in such a case, exactly
> > because I _don't_ want to end up with a bunch of remotes. You can also
> > mark them with skipDefaultUpdate if you only care about them
> > occasionally (so you can "git fetch sbeller" when you care about it, but
> > it doesn't slow down your daily "git fetch").
> 
> And I assume you don't want the remotes because it takes time to fetch and not
> because your disk space is expensive. ;)

That, and it clogs the ref namespace. You can mostly ignore the extra
refs, but they show up in the "git checkout ..." DWIM, for example.

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]