Re: [RFC] Extending git-replace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kaushik,

On Mon, Jan 13, 2020 at 9:39 PM Kaushik Srenevasan <kaushik@xxxxxxxxxxx> wrote:
>
> We’ve been trying to get rid of objects larger than a certain size
> from one of our repositories that contains tens of thousands of
> branches and hundreds of thousands of commits. While we’re able to
> accomplish this using BFG[0] , it results in ~ 90% of the repository’s
> history being rewritten. This presents the following problems
> 1. There are various systems (Phabricator for one) that use the commit
> hash as a key in various databases. Rewriting history will require
> that we update all of these systems.

Not necessarily...

> 2. We’ll have to force everyone to reclone a copy of this repository.

True.

> I was looking through the git code base to see if there is a way
> around it when I chanced upon `git-replace`. While the basic idea of
> `git-replace` is what I am looking for, it doesn’t quite fit the bill
> due to the `--no-replace-objects` switch, the `GIT_NO_REPLACE_OBJECTS`
> environment variable, and `--no-replace-objects` being the default for
> certain git commands. Namely fsck, upload-pack, pack/unpack-objects,
> prune and index-pack. That Git may still try to load a replaced object
> when a git command is run with the `--no-replace-objects` option
> prevents me from removing it from the ODB permanently. Not being able
> to run prune and fsck on a repository where we’ve deleted the object
> that’s been replaced with git-replace effectively rules this option
> out for us.
>
> A feature that allowed such permanent replacement (say a
> `git-blacklist` or a `git-replace --blacklist`) might work as follows:
> 1. Blacklisted objects are stored as references under a new namespace
> -- `refs/blacklist`.
> 2. The object loader unconditionally translates a blacklisted OID into
> the OID it’s been replaced with.
> 3. The `+refs/blacklist/*:refs/blacklist/*` refspec is implicitly
> always a part of fetch and push transactions.
>
> This essentially turns the blacklist references namespace into an
> additional piece of metadata that gets transmitted to a client when a
> repository is cloned and is kept updated automatically.
>
> I’ve been playing around with a prototype I wrote and haven’t observed
> any breakage yet. I’m writing to seek advice on this approach and to
> understand if this is something (if not in its current form, some
> version of it) that has a chance of making it into the product if we
> were to implement it. Happy to write up a more detailed design and
> share my prototype as a starting point for discussion.

I'll get back to this in a minute, but wanted to point out a couple
other ideas for consideration:

1) You can rewrite history, and then use replace references to map old
commit IDs to new commit IDs.  This allows anyone to continue using
old commit IDs (which aren't even part of the new repository anymore)
in git commands and git automatically uses and shows the new commit
IDs.  No problems with fsck or prune or fetch either.  Creating these
replace refs is fairly simple if your repository rewriting program
(e.g. git-filter-repo or BFG Repo Cleaner) provides a mapping of old
IDs to new IDs, and if you are using git-filter-repo it even creates
the replace refs for you.  (The one downside is that you can't use
abbreviated refs to refer to replace refs, thus you can't use
abbreviated old commit IDs in this scheme.)

The downside is that various repository hosting tools ignore replace
refs.  Thus if you try to browse to a commit in the web UI of Gerrit
or GitHub using the old commit IDs, it'll just show you a commit not
found page.  Phabricator and GitLab may well be the same (haven't
tried yet).  However, teaching these tools to pay attention to replace
refs would make this simple mechanism for rewriting feel close to
seamless other than asking people to reclone.  It's possible that
teaching the Webby tools to pay attention to replace refs might not be
too difficult, at least for the open source systems, though I admit I
haven't dug into it myself.

2) Some folks might be okay with a clone that won't pass fsck or
prune, at least in special circumstances.  We're actually doing that
on purpose to deal with one of our large repositories.  We don't
provide that to normal developers, but we do use "cheap, fake clones"
in our CI systems.  These slim clones have 99% of all objects, but
happen to be missing the really big ones, resulting in only needing
1/7 of the time to download.  (And no, don't try to point out shallow
clones to me.  I hate those things, they're an awful hack, *and* they
don't work for us.  It's nice getting all commit history, all trees,
and most blobs including all for at least the last two years while
still saving lots of space.)

[For the curious, I did make a simple script to create these "cheap,
fake clones" for repositories of interest.  See
https://github.com/newren/sequester-old-big-blobs.  But they are
definitely a hack with some sharp corners, with failing fsck and
prunes only being part of the story.]


3) Back to your idea...

What you're proposing actually sounds very similar to partial clones,
whose idea is to make it okay to download a subset of history.  The
primary problems with partial clones are (a) they are still under
development and are just experimental, (b) they are currently
implemented with a "promisor" mode, meaning that if a command tries to
run over any piece of missing data then the command pauses while the
objects are downloaded from the server.  I want an offline mode (even
if I'm online) where only explicit downloading from the server (clone,
fetch, etc.) occurs.

Instead of inventing yet another partial-clone-like thing, it'd be
nice if your new mechanism could just be implemented in terms of
partial clones, extending them as you need.  I don't like the idea of
supporting multiple competing implementations of partial clones
withing git.git, but if it's just some extensions of the existing
capability then it sounds great.  But you may want to talk with
Jonathan Tan if you want to go this route (cc'd), since he's the
partial clone expert.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux