Re: [PATCH v2 0/4] [RFC] repack: add --filter=

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 21 Feb 2022 12:57:03 -0500

On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote:
> Hi,
>
> On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@xxxxxxxxxxxx> wrote:
> >
> > we would still be leaving repository
> > corruption on the table, just making it marginally more difficult to
> > achieve.
>
> While reviewing John's patch I initially wondered if a better approach
> might be something like `git repack -a -d --exclude-stdin`, passing a
> list of specific objects to exclude from the new pack (sourced from
> rev-list via a filter, etc). To me this seems like a less dangerous
> approach, but my concern was it doesn't use the existing filter
> capabilities of pack-objects, and we end up generating and passing
> around a huge list of oids. And of course any mistakes in the list
> generation aren't visible until it's too late.

Yeah; I think the most elegant approach would have pack-objects do as
much work as possible, and have repack be in charge of coordinating what
all the pack-objects invocation(s) have to do.

> I also wonder whether there's a race condition if the repository gets
> updated? If you're moving large objects out in advance, then filtering
> the remainder there's nothing to stop a new large object being pushed
> between those two steps and getting dropped.

Yeah, we will want to make sure that we're operating on a consistent
view of the repository. If this is all done in-process, it won't be a
problem since we'll capture an atomic snapshot of the reference states
once. If this is done across multiple processes, we'll need to make sure
we're passing around that snapshot where appropriate.

See the `--refs-snapshot`-related code in git-repack for when we write a
multi-pack bitmap for an example of the latter.

> My other idea, which is growing on me, is whether repack could
> generate two valid packs: one for the included objects via the filter
> (as John's change does now), and one containing the filtered-out
> objects. `git repack -a -d --split-filter=<filter>` Then a user could
> then move/extract the second packfile to object storage, but there'd
> be no way to *accidentally* corrupt the repository by using a bad
> option. With this approach the race condition above goes away too.
>
>     $ git repack -a -d -q --split-filter=blob:limit=1m
>     pack-37b7443e3123549a2ddee31f616ae272c51cae90
>     pack-10789d94fcd99ffe1403b63b167c181a9df493cd
>
> First pack identifier being the objects that match the filter (ie:
> commits/trees/blobs <1m), and the second pack identifier being the
> objects that are excluded by the filter (blobs >1m).

I like this idea quite a bit. We also have a lot of existing tools that
would make an implementation fairly lightweight, namely pack-objects'
`--stdin-packs` mode.

Using that would look something like first having `repack` generate the
filtered pack first, remembering its name [1]. After that, we would run
`pack-objects` again, this time with `--stdin-packs`, where the positive
packs are the ones we're going to delete, and the negative pack(s)
is/are the filtered one generated in the last step.

The second invocation would leave us with a single pack which represents
all of the objects in packs we are about to delete, skipping any objects
that are in the filtered pack we just generated. In other words, it
would leave the repository with two packs: one with all of the objects
that met the filter criteria, and one with all objects that don't meet
the filter criteria.

A client could then upload the "doesn't meet the filter criteria" pack
off elsewhere, and then delete it locally. (I'm assuming this last part
in particular is orchestrated by some other command, and we aren't
encouraging users to run "rm" inside of .git/objects/pack!)

> An astute --i-know-what-im-doing reader could spot that you could just
> delete the second packfile and achieve the same outcome as the current
> proposed patch, subject to being confident the race condition hadn't
> happened to you.

Yeah, and I think this goes to my joking remark in the last paragraph.
If we allow users to delete packs at will, all bets are off regardless
of how safely we generate those packs. But I think splitting the
repository into two packs and _then_ dealing with one of them separately
as opposed to deleting objects which don't meet the filter criteria
immediately is moving in a safer direction.

> Thanks,
> Rob :)

Thanks,
Taylor

[1]: Optionally "name_s_", if we passed the `--max-pack-size` option to
`git pack-objects`, which we can trigger via a `git repack` option of
the same name.