Re: help moving boost.org to git

Avery Pennarun <apenwarr@xxxxxxxxx> · Mon, 5 Jul 2010 19:32:17 -0400

(note: on this mailing list, you shouldn't drop names from the cc:
line when replying to a thread)

On Mon, Jul 5, 2010 at 7:11 PM, Eric Niebler <eric@xxxxxxxxxxxx> wrote:
> On 7/5/2010 6:04 PM, Finn Arne Gangstad wrote:
>> This
>> should fit eaily into a single repository. The Linux kernel is much
>> larger, and that is sort of the canonical single repo git project. I
>> _strongly_ recommend that you go for a single repo if you can make it
>> work.
>
> It does fit into one repo, but that doesn't meet our needs for the
> future. Users want to install and build library X and its dependencies,
> not all of boost. This is increasingly becoming a problem as boost
> grows. Imagine if a perl programmer had to download all of CPAN to use
> or hack on any one perl module. Or if contributing to CPAN meant getting
> the whole shebang, history and all. I'm sure even in the Linux kernel,
> not *every* third-party driver is maintained in the master git repo.

Actually, that's mostly not true; there are a few third-party drivers
that don't make it into the core Linux repo, but that's mostly because
they haven't been accepted by the kernel maintainers for whatever
reason (often quality or duplication, I guess).  The goal for the vast
majority of Linux drivers is indeed to get merged into the Linux core.

...and it works pretty well, all things considered.  It's certainly
not the only way to do it for every project, but it's actually a
pretty good way.  The kernel repo history runs to hundreds of megs
nowadays, but on a modern Internet connection that's not a big deal.
And then you never have to worry about downloading more modules later.
 You also never have versioning problems.

> We are aiming to make boost a clearing-house for C++ libraries (like
> CPAN, or PyPi for python), turning the official boost distribution into
> little more than a well-tested collection of the libraries that have
> passed our peer-review and regression test process.

Of course you will want to have some kind of really excellent
versioned dependency fetching system (exactly like CPAN or PyPi or
ruby gems) if you want this to be nice.  git's submodules stuff is
almost certainly not going to add any features you need/want.  On the
other hand, cloning a separate git repo is pretty easy to write your
CPAN-like script around.

> In fact, the modularization has already been done, and work is well
> underway on the infrastructure to support dependency tracking. But the
> modularization is not history-preserving and needs to be redone.

If your code doesn't move too many files around, then splitting out
the history is pretty easy with git-subtree (a tool I wrote that's not
part of git):

   git subtree split --prefix=/path/to/subdir

And you get a new history for just that subdir.  That might do exactly
what you want.  It also works iteratively, so you can export your
history from svn, then re-export the changes as they occur over time.

>>> So,, what are the options? Can I somehow delete from each repository the
>>> history that is irrelevant? Is these some feature of git I don't know
>>> about that can solve this problem for us?
>>
>> How do you define "irrelevant"? Do you only require enough history for
>> git annotate/blame to give correct results?  Or does this only refer
>> to multiple repositories sharing the same ancient history?
>
> If multiple repositories share the same ancient history, wouldn't that
> give git annotate/blame enough information? Sorry, git newbie here.

Yes, it would.  But how much of the ancient history do you want?  If
you want all of it, you don't save any space in your repo.

> The plan is to move to git. However, we don't expect this to happen
> overnight, so a way to continue to pull changes from a svn mirror while
> the new git repositories are being set up would be ideal.

This isn't too hard to do; you just need some scripts around git-svn
and git-subtree (or whatever tool you use to do the splitting).  We've
done this at work for a couple of years now and it's working fine.

The confusing part is taking *submissions* back through both channels.
 If you value your sanity, you probably want to only allow submissions
back via svn while you're running the two in parallel; but that makes
git's added features a lot less useful, so you probably want to run in
parallel for only a short time.

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html