Re: [RFC] Add --create-cache to repack

Shawn Pearce <spearce@xxxxxxxxxxx> · Fri, 28 Jan 2011 06:37:22 -0800

On Fri, Jan 28, 2011 at 01:08, Johannes Sixt <j.sixt@xxxxxxxxxxxxx> wrote:
> Am 1/28/2011 9:06, schrieb Shawn O. Pearce:
>> A cache pack is all objects reachable from a single commit that is
>> part of the project's stable history and won't disappear, and is
>> accessible to all readers of the repository.  By containing only that
>> commit and its contents, if the commit is reached from a reference we
>> know immediately that the entire pack is also reachable.  To help
>> ensure this is true, the --create-cache flag looks for a commit along
>> refs/heads and refs/tags that is at least 1 month old, working under
>> the assumption that a commit this old won't be rebased or pruned.
>
> In one of my repositories, I have two stable branches and a good score of
> topic branches of various ages (a few hours up to two years 8). The topic
> branches will either be dropped eventually, or rebased.
>
> What are the odds that this choice of a tip commit picks one that is in a
> topic branch? Or is there no point in using --create-cache in a repository
> like this?

Argh, you are right.  Its quite likely this would pick a topic
branch... and that isn't really what is desired.

My original concept here was for distribution point repositories,
which are less likely to have these topic branches that will rebase
and disappear.  Though git.git has one called "pu".  *sigh*

A simple fix is to use --heads --tags by default like I do here, but
make the actual parameters we feed to rev-list configurable.  A
repository owner could select only the master branch as input to
rev-list, making it less likely the topic branches would be
considered.  Unfortunately that requires direct access to the
repository.  It fails for a site like GitHub, where you don't manage
the repository at all.

git.git also is problematic because of the man, html and todo
branches.  Branches that are disconnected from the main history but
are very small (e.g. todo) might be selected instead and create a
nearly useless cache file.  Fortunately disconnected branches could
each have their own cache file (with only the inode overhead of having
an additional 3 files per disconnected branch), and pack-objects could
concat all of those packs together when sending.  Its just a challenge
to identify these branches and keep them from being used for that main
project pack.

This started because I was looking for a way to speed up clones coming
from a JGit server.  Cloning the linux-2.6 repository is painful, it
takes a long time to enumerate the 1.8 million objects.  So I tried
adding a cached list of objects reachable from a given commit, which
speeds up the enumeration phase, but JGit still needs to allocate all
of the working set to track those objects, then go find them in packs
and slice out each compressed form and reformat the headers on the
wire.  Its a lot of redundant work when your kernel repository has
360MB of data that you know a client needs if they have asked for your
master branch with no "have" set.

Later I realized, we can get rid of that cached list of objects and
just use the pack itself.  Its far cleaner, as there is no redundant
cache.  But either way (object list or pack) its a bit of a challenge
to automatically identify the right starting points to use.  Linus
Torvalds' linux-2.6 repository is the perfect case for the RFC I
posted, its one branch with all of the history, and it never rewinds.
But maybe Linus is just very unique in this world.  :-)

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html