Hi Damien, James, (and 4 other who voted for the topic) I had been thinking about 'missing' blobs for a long while as a earlier 'partial clone' concept (unpublished) On 15/03/2020 22:19, Damien Robert wrote: > From James Ramsay, Thu 12 Mar 2020 at 14:57:24 (+1100) : >> 6. Elijah: replace refs helps, but not supported by hosts like GitHub etc >> a. Stolee: breaks commit graph because of generation numbers. >> b. Replace refs for blobs, then special packfile, there were edge cases. > I am interested in more details on how to handle this using replace. > > My situation: coworkers push big files by mistake, I don't want to rewrite > history because they are not too well versed with git, but I want to keep > *my* repo clean. > > Partial solution: > - identify the large blobs (easy) > - write a replace ref (easy): > $ git replace b5f74037bb91 $(git hash-object -w -t blob /dev/null) > and replace the file (if it is still in the repo) by an empty file. Here, my idea was to create a deliberately malformed blob object that would allow self reference to say "this blob is deliberately missing". (i.e. the same content would exist under two oids, one valid, one invalid) The change would require extra code (more below). Managing the verification of the replacement is a bigger problem, especially if already pushed to a server. > Now the pain points start: > - first the index does not handle replace (I think), so the replaced file > appear as changed in git status, even through eg git diff shows nothing. > > => Solution: configure .git/info/sparse-checkout > > - secondly, I want to remove the large blob from my repo. > > Ideally I'd like to repack everything but filter this blob, except that > repack does not understand --filter. So I need to use `git pack-objects` > directly and then do the naming and clean up that repack usually does > manually, which is error prone. > > Furthermore, while `git pack-objects` accepts --filter, I can only filter on > blob size, not blob oid. (there is filter=sparse:oid where I could reuse my > sparse checkout file, but I would need to make a blob of it first). And if I > have one large file I want to keep, I cannot filter by blob size. > > Another solution would be to use `git unpack-objects` to unpack all objects > (except I would need to do that in an empty git dir), remove the blob, and > then repack everything. > > Am I missing a simpler solution? > > - finally, checkouting to a ref including the replaced (now missing) blob > gives error messages of the form: > error: invalid object 100644 b5f74037bb91c45606b233b0ad6aad86f8e3875e for 'Silverman-Height-NonTorsion.pdf' > > On the one hand it is reassuring that git checks that the real object > (rather than only the replaced object) is still there, on the other hand it > would be nice to ask git to completely forget about the original object > (except fsck of course). > > Thanks, > Damien My notes on the "13. Obliterate" ideas. 1. If the object is in the wild & is dangerous : Stop: Failed: Damage limitation. 2. If the object is external, but still tame : Seek and recapture; either treat as internal, or treat as wild [1]. 3. The object is in captivity, even if distributed around an enclosure. Proceed to vaccination [4]. 4. Create new blob object with exact content "Git revoke: <oid>" (or similar) This object includes the embedded object type coding as part of the object format. This object is/becomes part of the git signature/oid commit hierarchy. This should (ultimately) be on 'master' branch as it is the verifier for the obliteration. 5. In the old revoked object <oid>, replace the object content (after zlib etc) with the same content as created in step 4. This deliberately _malformed_ object would normally cause fsck to barf. see [6] 6. However here we/fsck would detect the length and prefix of the (barfed) object contents and so determine its oid (the oid of the content). This results in an oid equal to that found in 4. which can be looked up and determined to be a self referral to this obliterated oid, so an fsck 'pass:obliterated' result is returned. This content could be actually be stored in any removed file if checked out! Consequences: Packs and other served object contents no longer contain the revoked oid. Hygiene/vaccination needs applied to other distributed recipients of the former defective object. Possible attacks: Attacker removes other important commits/blobs/trees by adding a 'revoke' which propagates to other users: Separate the hygiene cycle from the initial server revocation. For trees(?) and commits the message is "revoke <oid> <use-oid>". But where to 'hold' the commit & tree (maybe require that tree revoke is treated as a commit revoke, so the the new tree is got for free). We still need the new commit to be walked by fsck/gc, and the old oid contents to be gc'd. For a 'commit' revocation it (the new msg/trees/revision) could maybe be a 2nd (or third parent after {0}) so a 'normal walk finds it, but probably that's just a recipe for disaster. Maybe a revocation reflog that doesn't expire? or can be rebuilt (fsck would extend it's lost/found to include a revoked list). The new (XY) problem is now one of tying in the new revoked blob to the 'old' commit/tree hierarchy which only handles tracked files! Maybe its a .revoked file (like a .gitignore) which has a list of the old oids and has actual blobs attached under a .revoked tree. Also need to make sure that re-packing is done if the blob/tree/commit was a delta-base at the point of obliteration. Also need to prompt the local user, just in case it's a spoof!. Plus need a way of 'sending' the revocation. (and flag for what to do about a fetch pack containing a revocation for which we have the original, esp if we have it as a pack that will take a long time to recreate. Need a way of writing the 'defective' object (more code). Newhash transition. When histories are rewritten, then the obliterated artefacts are truly removed. For new repos using the newhash then the revocation mechanism is essentially the same other than extending the nominal size of the revocation objects. Perhaps use the 'submodule' commit object type (i.e 'stuff held elsewhere') for the holder of the revoked ID (for commits & trees). This could be locked into the history (details not fully thought through..). If there is a design error within Git, its the lack of an 'after the fact' redaction mechanism (and how it is spread across branches and distributed users/servers) - not easy. Philip