Re: [PATCH v2 09/18] maintenance: add loose-objects task

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 24 Jul 2020 10:50:12 -0400

On 7/23/2020 4:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> Create a 'loose-objects' task for the 'git maintenance run' command.
>> This helps clean up loose objects without disrupting concurrent Git
>> commands using the following sequence of events:
>>
>> 1. Run 'git prune-packed' to delete any loose objects that exist
>>    in a pack-file. Concurrent commands will prefer the packed
>>    version of the object to the loose version. (Of course, there
>>    are exceptions for commands that specifically care about the
>>    location of an object. These are rare for a user to run on
>>    purpose, and we hope a user that has selected background
>>    maintenance will not be trying to do foreground maintenance.)
> 
> OK.  That would make sense.
> 
>> 2. Run 'git pack-objects' on a batch of loose objects. These
>>    objects are grouped by scanning the loose object directories in
>>    lexicographic order until listing all loose objects -or-
>>    reaching 50,000 objects. This is more than enough if the loose
>>    objects are created only by a user doing normal development.
> 
> I haven't seen this in action, but my gut feeling is that this would
> result in horrible locality and deltification in the resulting
> packfile.  The order you feed the objects to pack-objects and the
> path hint you attach to each object matters quite a lot.
> 
> I do agree that it would be useful to have a task to deal with only
> loose objects without touching existing packfiles.  I just am not
> sure if 2. is a worthwhile thing to do.  A poorly constructed pack
> will also contaminate later packfiles made without "-f" option to
> "git repack".

There are several factors going on here:

 * In a partial clone, it is likely that we get loose objects only
   due to a command like "git log -p" that downloads blobs
   one-by-one. In such a case, this step coming in later and picking
   up those blobs _will_ find good deltas because they are present
   in the same batch.

 * (I know this case isn't important to core Git, but please indulge
   me) In a VFS for Git repo, the loose objects correspond to blobs
   that were faulted in by a virtual filesystem read. In this case,
   the blobs are usually from a single commit in history, so good
   deltas between the blobs don't actually exist!

 * My experience indicates that the packs created by the
   loose-objects task are rather small (when created daily). This
   means that they get selected by the incremental-repack task to
   repack into a new pack-file where deltas are recomputed with modest
   success. As mentioned in that task, we saw a significant compression
   factor using that step for users of the Windows OS repo, mostly due
   to recomputing tree deltas.

 * Some amount of "extra" space is expected with this incremental
   repacking scheme. The most space-efficient thing to do is a full
   repack along with a tree walk that detects the paths used for each
   blob, allowing better hints for delta compression. However, that
   operation is very _time_ consuming. The trade-off here is something
   I should make more explicit. In my experience, disk space is cheap
   but CPU time is expensive. Most repositories could probably do a
   daily repack without being a disruption to the user. These steps
   enable maintenance for repositories where a full repack is too
   disruptive.

I hope this adds some context. I would love if someone who knows more
about delta compression could challenge my assumptions. Sharing that
expertise can help create better maintenance strategies. Junio's
initial concern here is a good first step in that direction.

Thanks,
-Stolee