At Wed, 2 Dec 2009 00:18:30 +0300, Dmitry Potapov <dpotapov@xxxxxxxxx> wrote: Subject: Re: multiple working directories for long-running builds (was: "git merge" merges too much!) > > AFAIK, "git archive" is cheaper than git clone. It depends on what you mean by "cheaper" It's clearly going to require less disk space. However it's also clearly going to require more disk bandwidth, potentially a _LOT_ more disk bandwidth. > I do not say it is fast > for huge project, but if you want to run a process such as clean build > and test that takes a long time anyway, it does not add much to the > total time. I think you need to try throwing around an archive of, say, 50,000 small files a few times simultaneously on your system to appreciate the issue. (i.e. consider the load on a storage subsystem, say a SAN or NAS, where with your suggestion there might be a dozen or more developers running "git archive" frequently enough that even three or four might be doing it at the same time, and this on top of all the i/o bandwidth required for the builds all of the other developers are also running at the same time.) > > Disk bandwidth is almost always more expensive than disk space. > > Disk bandwidth is certainly more expensive than disk space, and the > whole point was to avoid a lot of disk bandwidth by using hot cache. Huh? Throwing around the archive has nothing to do with the build system in this case. Please let me worry about optimizing the builds -- that's well under control already and is not really yet an issue for the VCS, at least not yet, and maybe never in many cases. I'm just not willing to even consider using what would really be the most simplistic and most expensive form of updating a working directory as could ever be imagined. "Git archive" is truly unintelligent, as-is. Perhaps if "git archive" could talk intelligently to an rsync process and be smart about updating an existing working directory it would be the ideal answer, but _NEVER_ with the current method of just unpacking an archive over an existing directory! (Now there's a good Google SoC, or masters, project for someone eager to learn about rsync & git internals!) Local filesystem "git clone" is usable in many scenarios, but it just won't work nearly so efficiently in a scenario where users have local repos on their workstations and use an NFS NAS to feed the build servers. As I understand it this 'git-new-workdir' script will work though since it uses symlinks that can be pointed across the mount back to the local disk on the user's workstation. They can just mount the build directory and go into it and run a "git checkout" and start another build on the build server(s). A major further advantage of multiple working directories is that this eliminates one more point of failure -- i.e. you don't end up with multiple copies of the repo that _should_ be effectively read-only for everything but "push", and perhaps then only to one branch. I don't like giving developers too much rope, especially in all the wrong places. "git archive" does achieve the same even better I suppose, but without something like a "--format=rsync" option it's completely out of the question. > Another thing to consider is that if you put a really huge project in one > Git repo than Git may not be as fast as you may want, because Git tracks > the whole project as the whole. So, you may want to split your project in > a few relatively independent modules (See git submodule). Indeed -- but sometimes I think this is not feasible either. I know of at least three very real-world projects where there are tens of thousands of small files that really must be managed as one unit, and where running a build in that tree could take a whole day or two on even the fastest currently available dedicated build server. Eg. pkgsrc. -- Greg A. Woods Planix, Inc. <woods@xxxxxxxxxx> +1 416 218 0099 http://www.planix.com/
Attachment:
pgpx2jZsE4GLn.pgp
Description: PGP signature