On 7/23/2020 4:59 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> Create a 'loose-objects' task for the 'git maintenance run' command. >> This helps clean up loose objects without disrupting concurrent Git >> commands using the following sequence of events: >> >> 1. Run 'git prune-packed' to delete any loose objects that exist >> in a pack-file. Concurrent commands will prefer the packed >> version of the object to the loose version. (Of course, there >> are exceptions for commands that specifically care about the >> location of an object. These are rare for a user to run on >> purpose, and we hope a user that has selected background >> maintenance will not be trying to do foreground maintenance.) > > OK. That would make sense. > >> 2. Run 'git pack-objects' on a batch of loose objects. These >> objects are grouped by scanning the loose object directories in >> lexicographic order until listing all loose objects -or- >> reaching 50,000 objects. This is more than enough if the loose >> objects are created only by a user doing normal development. > > I haven't seen this in action, but my gut feeling is that this would > result in horrible locality and deltification in the resulting > packfile. The order you feed the objects to pack-objects and the > path hint you attach to each object matters quite a lot. > > I do agree that it would be useful to have a task to deal with only > loose objects without touching existing packfiles. I just am not > sure if 2. is a worthwhile thing to do. A poorly constructed pack > will also contaminate later packfiles made without "-f" option to > "git repack". There are several factors going on here: * In a partial clone, it is likely that we get loose objects only due to a command like "git log -p" that downloads blobs one-by-one. In such a case, this step coming in later and picking up those blobs _will_ find good deltas because they are present in the same batch. * (I know this case isn't important to core Git, but please indulge me) In a VFS for Git repo, the loose objects correspond to blobs that were faulted in by a virtual filesystem read. In this case, the blobs are usually from a single commit in history, so good deltas between the blobs don't actually exist! * My experience indicates that the packs created by the loose-objects task are rather small (when created daily). This means that they get selected by the incremental-repack task to repack into a new pack-file where deltas are recomputed with modest success. As mentioned in that task, we saw a significant compression factor using that step for users of the Windows OS repo, mostly due to recomputing tree deltas. * Some amount of "extra" space is expected with this incremental repacking scheme. The most space-efficient thing to do is a full repack along with a tree walk that detects the paths used for each blob, allowing better hints for delta compression. However, that operation is very _time_ consuming. The trade-off here is something I should make more explicit. In my experience, disk space is cheap but CPU time is expensive. Most repositories could probably do a daily repack without being a disruption to the user. These steps enable maintenance for repositories where a full repack is too disruptive. I hope this adds some context. I would love if someone who knows more about delta compression could challenge my assumptions. Sharing that expertise can help create better maintenance strategies. Junio's initial concern here is a good first step in that direction. Thanks, -Stolee