Ideas to speed up repacking

Martin Fick <mfick@xxxxxxxxxxxxxx> · Mon, 2 Dec 2013 16:30:45 -0700

I wanted to explore the idea of exploiting knowledge about 
previous repacks to help speed up future repacks.  

I had various ideas that seemed like they might be good 
places to start, but things quickly got away from me.  
Mainly I wanted to focus on reducing and even sometimes 
eliminating reachability calculations since that seems to be 
be the one major unsolved slow piece during repacking.

My first line of thinking goes like this:  "After a full 
repack, reachability of the current refs is known.  Exploit 
that knowledge for future repacks."  There are some very 
simple scenarios where if we could figure out how to 
identify them reliably, I think we could simply avoid 
reachability calculations entirely, and yet end up with the 
same repacked files as if we had done the reachability 
calculations.  Let me outline some to see if they make sense 
as starting place for further discussion.

-------------

* Setup 1:  

  Do a full repack.  All loose and packed objects are added 
to a single pack file (assumes git config repack options do 
not create multiple packs).

* Scenario 1:

  Start with Setup 1.  Nothing has changed on the repo 
contents (no new object/packs, refs all the same), but 
repacking config options have changed (for example 
compression level has changed).

* Scenario 2:

   Starts with Setup 1.  Add one new pack file that was 
pushed to the repo by adding a new ref to the repo (existing 
refs did not change).

* Scenario 3: 

   Starts with Setup 1.  Add one new pack file that was 
pushed to the repo by updating an existing ref with a fast 
forward.

* Scenario 4:

   Starts with Setup 1.  Add some loose objects to the repo 
via a local fast forward ref update (I am assuming this is 
possible without adding any new unreferenced objects?)

In all 4 scenarios, I believe we should be able to skip 
history traversal and simply grab all objects and repack 
them into a new file?

-------------

Of the 4 scenarios above, it seems like #3 and #4 are very 
common operations (#2 is perhaps even more common for 
Gerrit)?  If these scenarios can be reliably identified 
somehow, then perhaps they could be used to reduce repacking 
time for these scenarios, and later used as building blocks 
to reduce repacking time for other related but slightly more 
complicated scenarios (with reduced history walking instead 
of none)?

For example to identify scenario 1, what if we kept a copy 
of all refs and their shas used during a full repack along 
with the newly repacked file?  A simplistic approach would 
store them in the same format as the packed-refs file as 
pack-<sha>.refs.  During repacking, if none of the refs have 
changed and there are no new objects...  

Then, if none of the refs have changed and there are new 
objects, we can just throw the new objects away?

...

I am going to stop here because this email is long enough 
and I wanted to get some feedback on the ideas first before 
offering more solutions.

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html