Re: [RFC] Extending git-replace

Elijah Newren <newren@xxxxxxxxx> · Tue, 14 Jan 2020 12:39:49 -0800

On Tue, Jan 14, 2020 at 11:05 AM Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
>
> > That is, would it be sufficient if every replaced file were replaced
> > with the exact text "me caga en la leche" instead of a custom hand-
> > crafted replacement?  I guess it's a bit complicated because while
> > that's a reasonable blob, it's not a valid commit.  So maybe this
> > mechanism would be limited to blobs.  I thought about whether we could
> > a different flavor of replacement for commits, but those generally have
> > to be custom because they each have different parents.
>
> Since the original email just discussed blobs, I'll confine myself to
> discussing blobs. (Commits are trickier, as you said.)
>
> > And if that would be sufficient, could promisors be used for this?  I
> > don't know how those interact with fsck and the other commands that
> > you're worried about.  Basically, the idea would be to use most of the
> > existing promisor code, and then have a mode where instead of visiting
> > the promisor, we just always return "me caga en la leche" (and this
> > does not have its SHA checked, of course).

Maybe; it doesn't necessarily need to be the same object returned, and
these replacements could be user-specified via replace refs...

> Missing promisor objects do not prevent fsck from passing - this is part
> of the original design (any packfiles we download from the specifically
> designated promisor remote are marked as such, and any objects that the
> objects in the packfile refer to are considered OK to be missing).

Is there ever a risk that objects in the downloaded packfile come
across as deltas against other objects that are missing/excluded, or
does the partial clone machinery ensure that doesn't happen?  (Because
this was certainly the biggest pain-point with my "fake cheap clone"
hacks.)

> Currently, when a missing object is read, it is first fetched (there are
> some more details that I can go over if you have any specific
> questions). What you're suggesting here is to return a fake blob with
> wrong hash - I haven't looked at all the callers of read-object
> functions in detail, but I don't think all of them are ready for such a
> behavioral change.

git-replace already took care of that for you and provides that
guarantee, modulo the --no-replace-objects & fsck & prune & fetch &
whatnot cases that ignore replace objects as Kaushik mentioned.  I
took advantage of this to great effect with my "fake cheap clone"
hacks.  Based in part on your other email where you made a suggestion
about promisors, I'm starting to think a pretty good first cut
solution might look like the following:

  * user manually adds a bunch of replace refs to map the unwanted big
blobs to something else (e.g. a README about how the files were
stripped, or something similar to this)
  * a partial clone specification that says "exclude objects that are
referenced by replace refs"
  * add a fake promisor to the downloaded promisor pack so that if
anyone runs with --no-replace-objects or similar then they get an
error saying the specified objects don't exist and can't be
downloaded.

Anyone see any obvious problems with this?

>  Maybe it would be sufficient to just make this work
> in a more limited scope (e.g. checkout only - and if we need different
> replacement blobs for different object IDs, maybe we could have
> something similar to the clean/smudge filters).

>
> > This could work together with some sort refs/blacklist mechanism to
> > enable the server to choose which objects the client replaces.
>
> In the original email, Kaushik mentioned objects larger than a certain
> size - we already have support for that (--filter=blob:limit=1000000,
> for example). Having said that, Git is already able to tolerate any
> exclusion (of tree or blob) from the server - we already need this in
> order to support changing of filters, for example.