Re: clarify git clone --local --shared --reference

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Wed, 6 Jun 2007 01:11:11 -0400

Brandon Casey <casey@xxxxxxxxxxxxxxx> wrote:
> Shawn O. Pearce wrote:
> >
> >  b) Don't repack the source repository without accounting for the
> >  refs and reflogs of all --shared repositories that came from it.
> >  Otherwise you may delete objects that the source repository no
> >  longer needs, but that one or more of the --shared repositories
> >  still needs.
> 
> How should this be accomplished? Does this mean never run 
> git-gc/git-repack on the source repository? Or is there a way to
> cause the sharing repositories to copy over objects no longer
> required by the source repository?

Well, you can repack, but only if if you account for everything.
The easiest way to do this is push every branch from the --shared
repos to the source repository, repack the source repository, then
you can run `git prune-packed` in the --shared repos to remove
loose objects that the source repository now has.

You can account for the refs by hand when you run pack-objects
by hand, but its horribly difficult compared to the push and then
repack I just described.  I think that long-lived --shared isn't that
common of a workflow; most people use --shared for shortterm things.
For example contrib/continuous uses --shared when it clones the
repository to create a temporary build area.

> >>4) is space savings obtained only at initial clone? or is it on going?
> >>   does a future git pull from the source repository create new hard
> >>   links where possible?
> >
> >Only on initial clone.  Later pulls will copy.  You can try using
> >git-relink to redo the hardlinks after the pull.
> 
> How about with --shared? Particularly with a fast-forward not much
> would need to be copied over. Do later pulls into a repository with
> configured objects/info/alternates take advantage of space savings
> when possible?

Yes.  Recently a --shared avoids copying the objects if at all
possible.  This makes fetches from the source repository into the
--shared repository very, very fast, and uses no additional disk.

> If the answer above is "yes", then this brings up an interesting use 
> case. I assume that clone, fetch, etc follow the alternates of the 
> source repository? Otherwise a --shared repository would be unclone-able 
> right? And only pull-able from the source repository? So if that is the 
> case (that remote alternates are followed),

Alternates are followed as many as 5 deep.  So you can do something like
this:

	git clone --shared source share1
	git clone --shared share1 share2
	git clone --shared share2 share3
	git clone --shared share3 share4
	git clone --shared share4 share5
	git clone --shared share5 corrupt

I think corrupt is corrupt; it doesn't have access to the source anymore
and therefore is missing 90%+ of the object database.  To help make this
case work the objects/info/alternates should always contain absolute paths;
we store them absolute in git-clone by default but you could set them up
by hand.  The other repositories should however be intact and usable, but
you cannot clone from share5.

Normal fetch/push/pull will work fine against any of those working
repos, as they are all using the normal Git object transport methods,
which means we copy objects unless they are available to us already
(see above).

> then a group of developers 
> could add all of the other developers to their alternates list (if 
> multiple alternates are supported)

Yes, they are.  I don't think we have a limit on the number of
alternates you are allowed to have.  However each additional
alternate adds some cost to starting up any given Git process.
The more alternates you have (or the more deeply nested they are)
the slower Git will initialize itself.  For 1 or 2 alternates its
within the fork+exec noise of any good UNIX system; for 50 alternates
I think you would notice it.

> and reference their objects when 
> possible. To the extent that it is possible, each developer would end up 
> only storing their commit objects. This would then create a distributed 
> repository.

Yes, but that has very high risk.  If developer Joe Smith quits and
then the administrator `rm -rf /home/jsmith` everyone is hosed as
they can no longer access the objects that were originally created
by Joe.  Then the administrator is off looking for backup tapes,
assuming he has them and they are valid.  One nice property of Git
(really any DVCS) is that the data is automatically backed up by
every developer participating in the project.  Its unlikely you
will lose the project that way.

Also this scheme doesn't really work well for packing.  I don't
think we'll pack the loose objects that we borrow from the other
developers, and Git packfiles are a major performance improvement
for all Git operations.  Plus they are very small, so they save a
lot of disk.

You might find that it takes up less total disk to have everyone
keep a complete (non --shared) copy of the project, but repack
regularly, then to have everyone using alternates against each
other and nobody repacks.

> Of course, this new distributed repository may be somewhat fragile since 
> the entire thing could become unusable if any portion was corrupted. 
> Just because you can do a thing, doesn't mean you should.

Yes, exactly.  ;-)

In my day-job repositories I have about 150 MiB of blobs that
are very common across a number of Git repositories.  I've made a
single repo that has all of those packed, and then setup that as an
alternate for everything else.  It saves a huge chunk of disk for us.
But that common-blob.git thing that I created never gets changed,
and never gets repacked.  Its sort of a "historical archive" for us.
Works very nicely.  Alternates have their uses...

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html