On 2/10/2020 9:50 AM, Johannes Berg wrote: > On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote: > >> Part of my initial plan was to have this incremental file format. >> The commit-graph uses a very similar mechanism. The difference may >> be that you likely allow multiple .midx files found by scanning the >> pack directory, > > Right, just scan and use any midx that exist, then compare the packs in > there against all the packs found, and then remove any packs that > actually *are* in an midx from the search list. That leaves you with all > information, but optimised by midx where possible. > >> but I would expect something like the >> "commit-graph-chain" file that provides an ordered list of the >> incremental files. This can be important for deciding when to merge >> layers or delete old files, and would be critical to the possibility >> of converting reachability bitmaps to rely on a stable object order >> stored in the multi-pack-index instead of pack-order. > > Right, if we delete then we have to also remove any midx covering the > deleted pack, that's pretty rare in bup as a backup tool though. > >> The reason the multi-pack-index has not become incremental is that >> VFS for Git no longer needs to write it very often. We write the >> entire multi-pack-index during a background job that triggers once >> per day. If we needed to write it more frequently, then the incremental >> format would be more important to us. > > So, wait, what if a new pack is created? Does it just get used in > addition to the multi-pack-index, if it's not covered by it, like I > described above? > > If so, I guess it wouldn't actually really matter here. I was afraid > (but didn't check yet) that git would always use only the single multi- > pack-index file, and not also search additional packs, so that it always > has to be maintained in "perfect order" ... Git loads the multi-pack-index file, which includes a sorted list of the packs it covers. It then scans the "pack" directory for pack-indexes and checks if they are covered by the multi-pack-index. If not, then Git will add them to the packed_git struct and use them as normal. The hope is that this list of "uncovered" packs is small compared to the data covered by the multi-pack-index. This allows Git to continue functioning after an action like "git fetch" that adds a new pack but may not want to rewrite the multi-pack-index. Our background maintenance essentially runs these commands: 1. git multi-pack-index write 2. git multi-pack-index expire 3. git multi-pack-index repack Step 1 ensures all packs are pulled into the multi-pack-index. Step 2 deletes any pack-files whose objects are contained in newer pack-files. Step 3 creates a new pack-file containing all objects from a set of small pack-files (using the --batch-size=X option). This process helps incrementally reduce the size and number of packs. That may be helpful for your backup took, too. Perhaps after an incremental multi-pack-index is added, then Git could (optionally) have a mode that only checks the multi-pack-index to avoid scanning the packs directory. It would require inserting a multi-pack-index write into the index-pack logic so Git. I'm not sure if that mode would be helpful, since the pack directory scan is typically done once per command and is relatively fast. >> That said: if someone wanted to contribute an incremental format, >> then I would be happy to review it! > > I might still get motivated to do so :-) YOU CAN DO IT! (Did that help?) -Stolee