On 7/23/2020 6:00 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> 1. 'git multi-pack-index write' creates a multi-pack-index file if >> one did not exist, and otherwise will update the multi-pack-index >> with any new pack-files that appeared since the last write. This >> is particularly relevant with the background fetch job. >> >> When the multi-pack-index sees two copies of the same object, it >> stores the offset data into the newer pack-file. This means that >> some old pack-files could become "unreferenced" which I will use >> to mean "a pack-file that is in the pack-file list of the >> multi-pack-index but none of the objects in the multi-pack-index >> reference a location inside that pack-file." > > An obvious alternative is to favor the copy in the older pack, > right? Is the expectation that over time, most of the objects that > are relevant would reappear in newer packs, so that eventually by > favoring the copies in the newer packs, we can retire and remove the > old pack, keeping only the newer ones? > > But would that assumption hold? The old packs hold objects that are > necessary for the older parts of the history, so unless you are > cauterizing away the old history, these objects in the older packs > are likely to stay with us longer than those used by the newer parts > of the history, some of which may not even have been pushed out yet > and can be rebased away? If we created a new pack-file containing an already-packed object, then shouldn't we assume that the new pack-file does a _better_ job of compressing that object? Or at least, doesn't make it worse? For example, if we use 'git multi-pack-index repack --batch-size=0', then this creates a new pack-file containing every previously-packed object. This new pack-file should have better delta compression than the previous setup across multiple pack-files. We want this new pack-file to be used, not the old ones. This "pick the newest pack" strategy is also what allows us to safely use the 'expire' option to drop old pack-files. If we always keep the old copies, then when we try to switch the new pack-files, we cannot delete the old packs safely because a concurrent Git process could be trying to reference it. One race is as follows: * Process A opens the multi-pack-index referring to old pack P. It doesn't open the pack-file as it hasn't tried to parse objects yet. * Process B is 'git multi-pack-index expire'. It sees that pack P can be dropped because all objects appear in newer pack-files. It deletes P. * Process A tries to read from P, and this fails. A must reprepare its representation of the pack-files. This is the less disruptive race since A can recover with a small cost to its performance. The worse race (on Windows) is this: * Process A loads the multi-pack-index and tries to parse an object by loading "old" pack P. * Process B tries to delete P. However, on Windows the handle to P by A prevents the deletion. At this point, there could be two resolutions. The first is to have the 'expire' fail because we can't delete A. This means we might never delete A in a busy repository. The second is that the 'expire' command continues and drops A from the list in the multi-pack-index. However, now all Git processes add A to the packed_git list because it isn't referenced by the multi-pack-index. In summary: the decision to pick the newer copies is a fundamental part of how the write->expire->repack loop was designed. >> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files >> and updaes the multi-pack-index to drop those pack-files from the >> list. This is safe to do as concurrent Git processes will see the >> multi-pack-index and not open those packs when looking for object >> contents. (Similar to the 'loose-objects' job, there are some Git >> commands that open pack-files regardless of the multi-pack-index, >> but they are rarely used. Further, a user that self-selects to >> use background operations would likely refrain from using those >> commands.) > > OK. > >> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set >> of pack-files that are listed in the multi-pack-index and creates >> a new pack-file containing the objects whose offsets are listed >> by the multi-pack-index to be in those objects. The set of pack- >> files is selected greedily by sorting the pack-files by modified >> time and adding a pack-file to the set if its "expected size" is >> smaller than the batch size until the total expected size of the >> selected pack-files is at least the batch size. The "expected >> size" is calculated by taking the size of the pack-file divided >> by the number of objects in the pack-file and multiplied by the >> number of objects from the multi-pack-index with offset in that >> pack-file. The expected size approximats how much data from that > > approximates. > >> pack-file will contribute to the resulting pack-file size. The >> intention is that the resulting pack-file will be close in size >> to the provided batch size. > >> +static int maintenance_task_incremental_repack(void) >> +{ >> + if (multi_pack_index_write()) { >> + error(_("failed to write multi-pack-index")); >> + return 1; >> + } >> + >> + if (multi_pack_index_verify()) { >> + warning(_("multi-pack-index verify failed after initial write")); >> + return rewrite_multi_pack_index(); >> + } >> + >> + if (multi_pack_index_expire()) { >> + error(_("multi-pack-index expire failed")); >> + return 1; >> + } >> + >> + if (multi_pack_index_verify()) { >> + warning(_("multi-pack-index verify failed after expire")); >> + return rewrite_multi_pack_index(); >> + } >> + if (multi_pack_index_repack()) { >> + error(_("multi-pack-index repack failed")); >> + return 1; >> + } > > Hmph, I wonder if these warning should come from each helper > functions that are static to this function anyway. > > It also makes it easier to reason about this function by eliminating > the need for having a different pattern only for the verify helper. > Instead, verify could call rewrite internally when it notices a > breakage. I.e. > > if (multi_pack_index_write()) > return 1; > if (multi_pack_index_verify("after initial write")) > return 1; > if (multi_pack_index_exire()) > return 1; > ... This is a cleaner model. I'll work on that. > Also, it feels odd, compared to our internal API convention, that > positive non-zero is used as an error here. ... > Exactly the same comment as 08/18 about natural/inherent ordering > applies here as well. I'll leave these to be resolved in the earlier messages. Thanks, -Stolee