Re: multiple working directories for long-running builds (was: "git merge" merges too much!)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



At Wed, 2 Dec 2009 00:18:30 +0300, Dmitry Potapov <dpotapov@xxxxxxxxx> wrote:
Subject: Re: multiple working directories for long-running builds (was:	"git merge" merges too much!)
> 
> AFAIK, "git archive" is cheaper than git clone.

It depends on what you mean by "cheaper"  It's clearly going to require
less disk space.  However it's also clearly going to require more disk
bandwidth, potentially a _LOT_ more disk bandwidth.

> I do not say it is fast
> for huge project, but if you want to run a process such as clean build
> and test that takes a long time anyway, it does not add much to the
> total time.

I think you need to try throwing around an archive of, say, 50,000 small
files a few times simultaneously on your system to appreciate the issue.

(i.e. consider the load on a storage subsystem, say a SAN or NAS, where
with your suggestion there might be a dozen or more developers running
"git archive" frequently enough that even three or four might be doing
it at the same time, and this on top of all the i/o bandwidth required
for the builds all of the other developers are also running at the same
time.)


> > Disk bandwidth is almost always more expensive than disk space.
> 
> Disk bandwidth is certainly more expensive than disk space, and the
> whole point was to avoid a lot of disk bandwidth by using hot cache.

Huh?  Throwing around the archive has nothing to do with the build
system in this case.

Please let me worry about optimizing the builds -- that's well under
control already and is not really yet an issue for the VCS, at least
not yet, and maybe never in many cases.

I'm just not willing to even consider using what would really be the
most simplistic and most expensive form of updating a working directory
as could ever be imagined.  "Git archive" is truly unintelligent, as-is.

Perhaps if "git archive" could talk intelligently to an rsync process
and be smart about updating an existing working directory it would be
the ideal answer, but _NEVER_ with the current method of just unpacking
an archive over an existing directory!  (Now there's a good Google SoC,
or masters, project for someone eager to learn about rsync & git
internals!)

Local filesystem "git clone" is usable in many scenarios, but it just
won't work nearly so efficiently in a scenario where users have local
repos on their workstations and use an NFS NAS to feed the build
servers.  As I understand it this 'git-new-workdir' script will work
though since it uses symlinks that can be pointed across the mount back
to the local disk on the user's workstation.  They can just mount the
build directory and go into it and run a "git checkout" and start
another build on the build server(s).

A major further advantage of multiple working directories is that this
eliminates one more point of failure -- i.e. you don't end up with
multiple copies of the repo that _should_ be effectively read-only for
everything but "push", and perhaps then only to one branch.  I don't
like giving developers too much rope, especially in all the wrong
places.  "git archive" does achieve the same even better I suppose, but
without something like a "--format=rsync" option it's completely out of
the question.


> Another thing to consider is that if you put a really huge project in one
> Git repo than Git may not be as fast as you may want, because Git tracks
> the whole project as the whole. So, you may want to split your project in
> a few relatively independent modules (See git submodule).

Indeed -- but sometimes I think this is not feasible either.

I know of at least three very real-world projects where there are tens
of thousands of small files that really must be managed as one unit, and
where running a build in that tree could take a whole day or two on even
the fastest currently available dedicated build server.  Eg. pkgsrc.

-- 
						Greg A. Woods
						Planix, Inc.

<woods@xxxxxxxxxx>       +1 416 218 0099        http://www.planix.com/

Attachment: pgpx2jZsE4GLn.pgp
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]