Re: multiple working directories for long-running builds (was: "git merge" merges too much!)

"Greg A. Woods" <woods@xxxxxxxxxx> · Thu, 03 Dec 2009 00:11:09 -0500

At Wed, 2 Dec 2009 03:10:21 +0300, Dmitry Potapov <dpotapov@xxxxxxxxx> wrote:
Subject: Re: multiple working directories for long-running builds (was:	"git merge" merges too much!)
> 
> My point was that I do not see why you believe "git archive" is more
> expensive than "git clone". Accordingly to Jeff Epler's numbers,
> "git archive" is 20% faster than "git clone"...

Really!?!?!?  You don't see it?  Why is this so hard to understand?
Sorry for my incredulity, but I thought this issue was obvious.

The slightly more expensive "git clone" happens only _ONCE_.  After that
you just run "git pull" I think (plus maybe "git reset --hard"?), but in
any case it's a heck of a lot less I/O and CPU than "git archive".

And of course you skip even the one-time "git clone" operation if you
use the even faster and simpler git-new-workdir script.

"git archive" has to be run _EVERY_ time you need to update a working
directory and it currently has no choice but to toss every bit of the
whole working directory, up from the filesystem, across a pipe, and back
down to the filesystem.  It literally couldn't be more expensive!

Sure, no matter how you do it, updating the working directory might not
always be the biggest part of the operation, but it's insane to use the
most expensive mechanism ever possible when there are far cheaper
alternatives.

BTW, there cannot, and MUST NOT, be any integrity advantage to using
"git archive" over using multiple working directories.  "git archive
branch" must, by definition, produce exactly the same result as if you
did "git checkout branch; rm -rf .git" or else it is buggy.

Note also that the build directories created with git-new-workdir can be
treated as read-only, and perhaps even forced to be read-only by mount
options or maybe just by a corporate policy directive.  (in all projects
I'm working on the source tree can be read-only -- product files are
always generated elsewhere)

> Multiple copies of the same repo is never a problem (except taking some
> disks space).

Exactly -- gigabytes of disk space per copy in the cases I'm concerned
about (i.e. where hard links are impossible).  I've heard that at least
one very large project has an 8GB repository currently.  Three of the
large projects I work on now are about a gigabyte per copy.  That's just
what's under .git too, not including the whole working directory as
well.  I can't even manage a "git clone" from HTTP of one of them
without increasing my default process limits as it is so big and uses up
too much memory.

I guess one could skip the initial more-expensive "git clone" operation
by copying the repo using low-level bit moving commands, like "cp -r" or
whatever, and then tweak the result to make it appear as if it had been
cloned, but even that requires moving gigabytes of data unnecessarily
across what is likely to be a network connection of some sort.

Are you fighting against git-new-workdir, or the concept of multiple
working directories?

> > A major further advantage of multiple working directories is that this
> > eliminates one more point of failure -- i.e. you don't end up with
> > multiple copies of the repo that _should_ be effectively read-only for
> > everything but "push", and perhaps then only to one branch.
> 
> I really do not understand why you say that some copies
> should be effectively read-only... You can start to work on some feature
> at one place (using one repo) and then continue in another place using
> another repo. (Obviously, it will require to fetch changes from the
> first repo, before you will be able to continue, but it is just one
> command). In other words, I really do not understand what are you
> talking about here.

Developers, especially more junior ones, work on code, and they (are
supposed to) spend almost all of their intellectual energy on the issues
to do with creating and modifying code -- they are not expected to be
integration engineers, nor are they expected to be VCS and SCM experts.

The more steps you put in place for them to do, and the more places you
allow them to store changes, etc., etc., etc., the more mistakes that
they will make.

Besides, in some scenarios build directories will be checked out from
integration branches which shouldn't have any direct commits made to
them, especially not to fix a problem in a build.

BTW, pkgsrc has well over 50,000 files, FreeBSD ports is over 100,000.
Neither can really be split in any rational way.

-- 
						Greg A. Woods
						Planix, Inc.

<woods@xxxxxxxxxx>       +1 416 218 0099        http://www.planix.com/
Attachment:
pgpq6FXKLvMKt.pgp

Description: PGP signature