On Wed, Jul 26, 2023 at 1:04 AM Taylor Blau <me@xxxxxxxxxxxx> wrote: > > On Mon, Jul 24, 2023 at 10:59:06AM +0200, Christian Couder wrote: > > + /* > > + * names has a confusing double use: it both provides the list > > + * of just-written new packs, and accepts the name of the > > + * filtered pack we are writing. > > + * > > + * By the time it is read here, it contains only the pack(s) > > + * that were just written, which is exactly the set of packs we > > + * want to consider kept. > > + */ > > I think that this comment partially comes from the cruft pack code, > where we use the `names` string list both to reference existing packs at > the start of the repack, and to keep track of the pack we just wrote (to > exclude its contents from the cruft pack). > > But I think we only write into "names" via finish_pack_objects_cmd() to > record the name of the pack we just wrote containing objects which > didn't meet the filter's conditions. > > So I think that leaving this comment in is OK, but TBH I was on the > fence when I wrote that back in f9825d1cf75 (builtin/repack.c: support > generating a cruft pack, 2022-05-20), so I would just as soon drop it. I made the comment smaller in version 4. > > + in = xfdopen(cmd.in, "w"); > > + for_each_string_list_item(item, names) > > + fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string); > > + for_each_string_list_item(item, existing_packs) > > + fprintf(in, "%s.pack\n", item->string); > > > + for_each_string_list_item(item, existing_kept_packs) > > + fprintf(in, "^%s.pack\n", item->string); > > I think we may only want to do this if `honor_pack_keep` is zero. > Otherwise we'd avoid packing objects that appear in kept packs, even if > the caller told us to include objects found in kept packs. In version 4 I have made changes to better support kept packfiles and related options, including adding tests. > > @@ -858,6 +912,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) > > N_("limits the maximum number of threads")), > > OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"), > > N_("maximum size of each packfile")), > > + OPT_STRING(0, "filter", &po_args.filter, N_("args"), > > + N_("object filtering")), > > I suppose we're storing the filter as a string here because we're just > going to pass it down to pack-objects directly. That part makes sense, > but I think we are producing subtly inconsistent behavior when > specifying multiple --filter options. > > IIRC, passing --filter more than once down to pack-objects produces a > filter whose objects match all of the individually specified > sub-filters. But IIUC, using OPT_STRING here means that later > `--filter`'s override earlier ones. > > So I think at minimum we'd want to store the filter arguments in a > strvec. But I would probably just as soon parse them into a bona-fide > list_objects_filter_options struct, and then reconstruct the arguments > to pack-objects based on that. In version 4 a `list_objects_filter_options` struct is now used, and there is a test to check that more than one `--filter=<filter-spec>` option is supported. > > + git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none && > > + test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack && > > + commit_pack=$(test-tool -C bare.git find-pack HEAD) && > > + test -n "$commit_pack" && > > I wonder if the test-tool itself should exit with a non-zero code if it > can't find the given object in any pack. It would at least allow us to > drop the "test -n $foo" after every invocation of the test-helper in > this test. > > Arguably callers may want to ensure that an object doesn't exist in any > pack, and this would be inconvenient for them, since they'd have to > write something like: > > test_must_fail test-tool find-pack $obj > > but I think a more direct test like > > test_must_fail git cat-file -t $obj > > would do just as well. Thanks for these suggestions, but I prefered to add the `--check-count <n>` option to `test-tool find-pack` in version 4. This way `--check-count 0` or `-c 0` for short can be used to check that an object is in no packfile, though it could be for example in a promisor remote or a loose object file. It's also nice to be able to check that an object is in exactly 2 packfiles in some cases. > > + blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) && > > + test -n "$blob_pack" && > > + test "$commit_pack" != "$blob_pack" && > > + tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) && > > + test "$tree_pack" = "$commit_pack" && > > + blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) && > > + test "$blob_pack2" = "$blob_pack" > > +' > > This all looks good, but I think there are a couple of more things that > we'd want to test for here: > > - That the list of all objects appears the same before and after all > of the repacking. I think that this is tested implicitly already in > your test, but having it written down explicitly would harden this > against regressions that cause us to inadvertently delete an object > we shouldn't have. I don't think we need to test this. `git pack-objects --filter=<filter-spec>` already existed before this series and is tested elsewhere. We can trust that command and its tests, and just check that we used it correctly by checking that only a few objects are in the right packfiles. > (FWIW, I think this would be limited to running something like "git > cat-file --batch-check='%(objectname)' --batch-all-objects" before > and after all of the repacking, and ensuring that the two test_cmp > without failure). I agree that it would not be difficult to do. I just think it's not necessary. > - Another thing that I don't think we're testing here is that objects > that *don't* match the filter don't appear in one of the filtered > packs. In version 4 we do test that for some objects, as `test-tool find-pack -c 1 $object` would error out if the object is in more than one packfile. > I think we'd probably want to assert on the exact contents of > the pack by dumping the list of objects into a file like "expect", > and then dumping the actual set of objects with "git show-index > <$idx | cut -d' ' -f2" or something. > > Another thought from the OPT_STRING business above is that we probably > want to test this with non-trivial filter arguments. There are probably > a handful of interesting cases here, like passing `--no-filter`, passing > `--filter` multiple times, passing invalid values for `--filter`, etc. In version 4 there is one test passing `--filter=...` multiple times. I think this is enough, as the `list_objects_filter_options` struct and related functions and mechanisms are tested elsewhere already. > > +test_expect_success '--filter fails with --write-bitmap-index' ' > > + test_must_fail git -C bare.git repack -a -d --write-bitmap-index \ > > + --filter=blob:none && > > Do we want to ensure that we get the exit code corresponding with > showing the usage text? I could go either way, but I do think that we > should grep through the output on stderr to ensure that we get the > appropriate error message. I am not sure that testing the exit code and the stderr output is always needed. Here I think that this test is more for documentation purposes than really enforcing something important. In fact if the behavior would change and `--write-bitmap-index` would understand that it should write an MIDX instead of a regular index, that behavior change could be considered in some ways as an improvement and we would only need to remove 'test_must_fail' here. > > + git -C bare.git repack -a -d --no-write-bitmap-index \ > > + --filter=blob:none > > I don't think that this test is adding anything that the above > "repacking with a filter works" test isn't covering already. Ok, I have removed it in version 4. Thanks!