# Intro Last year, John Cai sent 2 versions of a patch series to implement `git repack --filter=<filter-spec>` and later I sent 4 versions of a patch series trying to do it a bit differently: - https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@xxxxxxxxx/ - https://lore.kernel.org/git/20221012135114.294680-1-christian.couder@xxxxxxxxx/ In these patch series, the `--filter=<filter-spec>` removed the filtered out objects altogether which was considered very dangerous even though we implemented different safety checks in some of the latter series. In some discussions, it was mentioned that such a feature, or a similar feature in `git gc`, or in a new standalone command (perhaps called `git prune-filtered`), should put the filtered out objects into a new packfile instead of deleting them. Recently there were internal discussions at GitLab about either moving blobs from inactive repos onto cheaper storage, or moving large blobs onto cheaper storage. This lead us to rethink at repacking using a filter, but moving the filtered out objects into a separate packfile instead of deleting them. So here is a new patch series doing that while implementing the `--filter=<filter-spec>` option in `git repack`. # Use cases for the new feature This could be useful for example for the following purposes: 1) As a way for servers to save storage costs by for example moving large blobs, or all the blobs, or all the blobs in inactive repos, to separate storage (while still making them accessible using for example the alternates mechanism). 2) As a way to use partial clone on a Git server to offload large blobs to, for example, an http server, while using multiple promisor remotes (to be able to access everything) on the client side. (In this case the packfile that contains the filtered out object can be manualy removed after checking that all the objects it contains are available through the promisor remote.) 3) As a way for clients to reclaim some space when they cloned with a filter to save disk space but then fetched a lot of unwanted objects (for example when checking out old branches) and now want to remove these unwanted objects. (In this case they can first move the packfile that contains filtered out objects to a separate directory or storage, then check that everything works well, and then manually remove the packfile after some time.) As the features and the code are quite different from those in the previous series, I decided to start a new series instead of continuing a previous one. Also since version 2 of this new series, commit messages, don't mention uses cases like 2) or 3) above, as people have different opinions on how it should be done. How it should be done could depend a lot on the way promisor remotes are used, the software and hardware setups used, etc, so it seems more difficult to "sell" this series by talking about such use cases. As use case 1) seems simpler and more appealing, it makes more sense to only talk about it in the commit messages. # Changes since version 2 Thanks to Junio who reviewed both version 1 and 2, and to Taylor who reviewed version 1! The changes are the following: - In patch 5/8, which introduces `--filter=<filter-spec>` option, some explanations about how to find which new packfile contains the filtered out objects have been added to the commit message following Junio's comments. - In patch 5/8, it was clarified in the commit message that `git pack-objects` is run twice in row (and not in parallel) to implement the new option according to Junio's comments. - In patch 5/8 also, the documentaion of the new option says that `--no-write-bitmap-index` (or the ++ `repack.writebitmaps` config option set to `false`) should be used along with the option as otherwise writing bitmap index will fail. And a corresponding new test called '--filter fails with --write-bitmap-index' has been added to t/t7700-repack.sh. This should address Taylor's comments about v1 that were not addressed by v2. - In patch 7/8, which implements the `--filter-to=<dir>` option, the commit message now recommends using Git alternates mechanism before this option is used to make sure the directory specified by the new option is accessible by the repo as it could otherwise corrupt the repo. It also says that in some cases it might not be necessary to use such a mechanism, which is why the feature doesn't check that directory specified is accessible. The documentation of the new option also loudly warns that the repo could be corrupted if the Git alternates mechanism, and has a new link to that mechanism's documentation. This is to address Junio's comments. - In patch 8/8, which implements the `gc.repackFilterTo` config option, a similar loud warning has been added, and similar doc changes have been made, to the documentation of the new config option (which corresponds to the `--filter-to=<dir>` command line option). # Commit overview * 1/8 pack-objects: allow `--filter` without `--stdout` This patch is the same as in v1 and v2. To be able to later repack with a filter we need `git pack-objects` to write packfiles when it's filtering instead of just writing the pack without the filtered out objects to stdout. * 2/8 t/helper: add 'find-pack' test-tool No change in this patch compared to v1 and v2. For testing `git repack --filter=...` that we are going to implement, it's useful to have a test helper that can tell which packfiles contain a specific object. * 3/8 repack: refactor finishing pack-objects command No change in this patch compared to v2. This is a small refactoring creating a new useful function, so that `git repack --filter=...` will be able to reuse it. * 4/8 repack: refactor finding pack prefix No change in this patch compared to v2. This is another small refactoring creating a small function that will be reused in the next patch. * 5/8 repack: add `--filter=<filter-spec>` option This actually adds the `--filter=<filter-spec>` option. It uses one `git pack-objects` process with the `--filter` option. And then another `git pack-objects` process with the `--stdin-packs` option. Only the commit message, documentation and tests have been changed a bit since v2. * 6/8 gc: add `gc.repackFilter` config option No change in this patch compared to v2 and v1. This is a gc config option so that `git gc` can also repack using a filter and put the filtered out objects into a separate packfile. * 7/8 repack: implement `--filter-to` for storing filtered out objects For some use cases, it's interesting to create the packfile that contains the filtered out objects into a separate location. This is similar to the `--expire-to` option for cruft packfiles. Only the commit message and the documentation have changed since version 2. They now explain and discuss the risks of using this option without making sure the specified directory is not accessible by the repo. * 8/8 gc: add `gc.repackFilterTo` config option This allows specifying the location of the packfile that contains the filtered out objects when using `gc.repackFilter`. As with the previous commit, since v2, the doc now explain and discuss the risks of using this option without making sure the specified directory is not accessible by the repo. # Range-diff since v2 1: 0bd1ad3071 = 1: 4d75a1d7c3 pack-objects: allow `--filter` without `--stdout` 2: e49cd723c7 = 2: fdf9b6e8cc t/helper: add 'find-pack' test-tool 3: 3f87772ea6 = 3: e7cfdebc78 repack: refactor finishing pack-objects command 4: 9997efaf33 = 4: 9c51063795 repack: refactor finding pack prefix 5: da27ecb91b ! 5: a90e8045c3 repack: add `--filter=<filter-spec>` option @@ Commit message This new option puts the objects specified by `<filter-spec>` into a separate packfile. - This could be useful if, for example, some large blobs take a lot of + This could be useful if, for example, some large blobs take up a lot of precious space on fast storage while they are rarely accessed. It could make sense to move them into a separate cheaper, though slower, storage. In other use cases it might make sense to put all the blobs into separate storage. - This is done by running two `git pack-objects` commands. The first one - is run with `--filter=<filter-spec>`, using the specified filter. It - packs objects while omitting the objects specified by the filter. - Then another `git pack-objects` command is launched using + It's possible to find which new packfile contains the filtered out + objects using one of the following: + + - `git verify-pack -v ...`, + - `test-tool find-pack ...`, which a previous commit added, + - `--filter-to=<dir>`, which a following commit will add to specify + where the pack containing the filtered out objects will be. + + This feature is implemented by running `git pack-objects` twice in a + row. The first command is run with `--filter=<filter-spec>`, using the + specified filter. It packs objects while omitting the objects specified + by the filter. Then another `git pack-objects` command is launched using `--stdin-packs`. We pass it all the previously existing packs into its stdin, so that it will pack all the objects in the previously existing packs. But we also pass into its stdin, the pack created by the previous @@ Documentation/git-repack.txt: depth is 4095. + that objects used in the working directory are not filtered + out. So for the split to fully work, it's best to perform it + in a bare repo and to use the `-a` and `-d` options along with -+ this option. See linkgit:git-rev-list[1] for valid -+ `<filter-spec>` forms. ++ this option. Also `--no-write-bitmap-index` (or the ++ `repack.writebitmaps` config option set to `false`) should be ++ used otherwise writing bitmap index will fail, as it supposes ++ a single packfile containing all the objects. See ++ linkgit:git-rev-list[1] for valid `<filter-spec>` forms. + -b:: --write-bitmap-index:: @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavaila + blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) && + test "$blob_pack2" = "$blob_pack" +' ++ ++test_expect_success '--filter fails with --write-bitmap-index' ' ++ test_must_fail git -C bare.git repack -a -d --write-bitmap-index \ ++ --filter=blob:none && ++ ++ git -C bare.git repack -a -d --no-write-bitmap-index \ ++ --filter=blob:none ++' + objdir=.git/objects midx=$objdir/pack/multi-pack-index 6: 49e4a184b4 = 6: 335b7f614d gc: add `gc.repackFilter` config option 7: 243c93aad3 ! 7: b1be7f60b7 repack: implement `--filter-to` for storing filtered out objects @@ Commit message It would be nice if this new different pack could be created in a different directory than the regular pack. This would make it possible to move large blobs into a pack on a different kind of storage, for - example cheaper storage. Even in a different directory this pack can be - accessible if, for example, the Git alternates mechanism is used to - point to it. + example cheaper storage. + + Even in a different directory, this pack can be accessible if, for + example, the Git alternates mechanism is used to point to it. In fact + not using the Git alternates mechanism can corrupt a repo as the + generated pack containing the filtered objects might not be accessible + from the repo any more. So setting up the Git alternates mechanism + should be done before using this feature if the user wants the repo to + be fully usable while this feature is used. + + In some cases, like when a repo has just been cloned or when there is no + other activity in the repo, it's Ok to setup the Git alternates + mechanism afterwards though. It's also Ok to just inspect the generated + packfile containing the filtered objects and then just move it into the + '.git/objects/pack/' directory manually. That's why it's not necessary + for this command to check that the Git alternates mechanism has been + already setup. While at it, as an example to show that `--filter` and `--filter-to` work well with other options, let's also add a test to check that these @@ Commit message ## Documentation/git-repack.txt ## @@ Documentation/git-repack.txt: depth is 4095. - this option. See linkgit:git-rev-list[1] for valid - `<filter-spec>` forms. + a single packfile containing all the objects. See + linkgit:git-rev-list[1] for valid `<filter-spec>` forms. +--filter-to=<dir>:: + Write the pack containing filtered out objects to the -+ directory `<dir>`. This can be used for putting the pack on a -+ separate object directory that is accessed through the Git -+ alternates mechanism. Only useful with `--filter`. ++ directory `<dir>`. Only useful with `--filter`. This can be ++ used for putting the pack on a separate object directory that ++ is accessed through the Git alternates mechanism. **WARNING:** ++ If the packfile containing the filtered out objects is not ++ accessible, the repo could be considered corrupt by Git as it ++ migh not be able to access the objects in that packfile. See ++ the `objects` and `objects/info/alternates` sections of ++ linkgit:gitrepository-layout[5]. + -b:: --write-bitmap-index:: @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix &existing_nonkept_packs, ## t/t7700-repack.sh ## -@@ t/t7700-repack.sh: test_expect_success 'repacking with a filter works' ' - test "$blob_pack2" = "$blob_pack" +@@ t/t7700-repack.sh: test_expect_success '--filter fails with --write-bitmap-index' ' + --filter=blob:none ' +test_expect_success '--filter-to stores filtered out objects' ' 8: 8cb3faa74c ! 8: ed66511823 gc: add `gc.repackFilterTo` config option @@ Documentation/config/gc.txt: gc.repackFilter:: +gc.repackFilterTo:: + When repacking and using a filter, see `gc.repackFilter`, the + specified location will be used to create the packfile -+ containing the filtered out objects. See the -+ `--filter-to=<dir>` option of linkgit:git-repack[1]. ++ containing the filtered out objects. **WARNING:** The ++ specified location should be accessible, using for example the ++ Git alternates mechanism, otherwise the repo could be ++ considered corrupt by Git as it migh not be able to access the ++ objects in that packfile. See the `--filter-to=<dir>` option ++ of linkgit:git-repack[1] and the `objects/info/alternates` ++ section of linkgit:gitrepository-layout[5]. + gc.rerereResolved:: Records of conflicted merge you resolved earlier are Christian Couder (8): pack-objects: allow `--filter` without `--stdout` t/helper: add 'find-pack' test-tool repack: refactor finishing pack-objects command repack: refactor finding pack prefix repack: add `--filter=<filter-spec>` option gc: add `gc.repackFilter` config option repack: implement `--filter-to` for storing filtered out objects gc: add `gc.repackFilterTo` config option Documentation/config/gc.txt | 16 +++ Documentation/git-pack-objects.txt | 4 +- Documentation/git-repack.txt | 23 ++++ Makefile | 1 + builtin/gc.c | 10 ++ builtin/pack-objects.c | 8 +- builtin/repack.c | 162 ++++++++++++++++++------- t/helper/test-find-pack.c | 35 ++++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/t5317-pack-objects-filter-objects.sh | 8 ++ t/t6500-gc.sh | 23 ++++ t/t7700-repack.sh | 90 ++++++++++++++ 13 files changed, 332 insertions(+), 50 deletions(-) create mode 100644 t/helper/test-find-pack.c -- 2.41.0.384.ged66511823