Re: [PATCH] Prevent megablobs from gunking up git packs

"Dana How" <danahow@xxxxxxxxx> · Tue, 22 May 2007 09:59:29 -0700

On 5/22/07, Jakub Narebski <jnareb@xxxxxxxxx> wrote:
Dana How wrote:
> There's actually an even more extreme example from my day job.
> The software team has a project whose files/revisions would be
> similar to those in the linux kernel (larger commits, I'm sure).
> But they have *ONE* 500MB file they check in because it takes
> 2 or 3 days to generate and different people use different versions of it.
> I'm sure it has 50+ revisions now. If they converted to git and included
> these blobs in their packfile, that's a 25GB uncompressed increase!
> *Every* git operation must wade through 10X -- 100X more packfile.
> Or it could be kept in 50+ loose objects in objects/xx ,
> requiring a few extra syscalls by each user to get a new version.
Or keeping those large objects in separate, _kept_ packfile, containing
only those objects (which can delta well, even if they are large).

Yes, I experimented with various changes to git-repack and
having it create .keep files just before coming up with the maxblobsize
approach.  The problem with a 12GB+ repo is not only the large
repack time,  but the fact that the repack time keeps growing with
the repo size.  So, with split packs, I had repack create .keep
files for all new packs except the last (fragmentary) one.  The next
repack would then only repack new stuff plus the single fragmentary
pack, keeping repack time from growing (until you deleted the .keep
files [just the ones with "repack" in them] to start over from scratch).
But this approach is not going to distribute commits and trees all that well.

Last night before signing off Junio proposed some partitioning ideas.
He presented them as ordering things *within* one pack;  what I had
tried was making repack operate in 2 passes: the first one would create
pack(s) containing commits+trees+tags, the 2nd would create
pack(s) containing only blobs.  Of course the first group would contain
only 1 tiny pack, and the latter 6 or 7 enormous packs.  I also combined
this with the previous paragraph, putting .keep files on all but the last
pack in each group.  Then the metadata always got repacked,
and the blob data only got its "tail" repacked.

Let's just stipulate that you've convinced me that putting everything
in packs, and not ejecting megablobs, is better or equivalent on
the "central" git repository which will replace (part of) our Perforce
repository.  What about the users' repositories?

Each person at my day job has his own workstation.  They are all
on a grid and are constantly running jobs in the background.
Each person would have at least one personal repo.  What should the
packing strategy be there?

(1) If we must put everything in packs,  then we could:
(1a) Repack everything in local repos, incurring large local runtimes.
      This extra work then denies the CPU cycles to the grid,
      which WILL be noticed and cause much whining.
      So the response will be to reduce window and/or turn
      on nodelta for some group of objects, worsening packing
      and failing to squash the whining.  This happens across
      20 to 30 workstations.  Or we reduce the frequency of
      repacking and stagger it across the network.  Since daily
      pull/fetch/checkout ("sync" in p4 parlance) grabs 400+ new
      revisions each day,  if we make repacking weekly we have
      a policy that results in 400*5/2=1000 extra loose blobs on average,
      and there will still be whining.  Why not just set maxblobsize
      to some size resulting in ~1000 loose blobs, leave window/depth
      at default, and enjoy <1hr repacking?
(1b) Repack everything ONLY in the central repo, and have the users' repos
     point to it as an alternate.  Now we have enormous network traffic.
      However, this is better than (1a), and was what I thought I'd be
      stuck with.  We still do have the possible problem of excessive
      packing time on the central repo,  but it's easier to solve/hide
     in just one place.
(2) We repack everything but leave megablobs loose.  Now packfiles
    are 13MB, repack time with default window/depth is <1hr,  and we
    can repack each users' repository from his own cron job.  This will
    be noticed,  but it won't cause too much complaining.  Most git
    operations by users will be against their local repos,  but the
    server's db will still be an alternate to fetch at least megablobs.
    This is not a problem compared to Perforce,  which stores *NO*
    repository state locally at all.

I really think megablob ejection from packs makes a lot of sense for local
repos on a network of workstations.  It lets me keep almost all repo
state locally very cheaply.  It is just another consequence of the tendency
that an adequate solution that operates principally on only 13MB of data
doesn't have to work as hard or as carefully as something
operating on the full 12GB -- three orders of magnitude larger.

If there's interest,  I could submit my other alterations to git-repack.
They still have bugs which would take a while to work out since
each run operates on 12GB of data.  With quicker runtimes,
maxblobsize was much quicker to debug even though I made
more stupid mistakes at first ;-)

Thanks,
--
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html