Re: fsck option to remove corrupt objects - why/why not?

Johan Herland <johan@xxxxxxxxxxx> · Thu, 16 Oct 2014 11:04:04 +0200

On Thu, Oct 16, 2014 at 2:13 AM, Ben Aveling <bena.001@xxxxxxxxxxxxxxx> wrote:
> On 14/10/2014 19:21, Jeff King wrote:
>> On Mon, Oct 13, 2014 at 09:37:27AM +1100, Ben Aveling wrote:
>>> A question about fsck - is there a reason it doesn't have an option to
>>> delete bad objects?
>>
>> If the objects are reachable, then deleting them would create other big
>> problems (i.e., we would be breaking the object graph!).
>
>
> The man page for fsck advises:
>
>    "Any corrupt objects you will have to find in backups or other
>    archives (i.e., you can just remove them and do an /rsync/ with some
>    other site in the hopes that somebody else has the object you have
>    corrupted)."
>
>
> And that seems sensible to me - the object is corrupt, it is unusable, the
> object graph is already broken, we already have big problems, removing the
> corrupt object(s) doesn't create any new problems, and it allows the
> possibility that the damaged objects can be restored.
>
> I ask because I have a corrupt repository, and every time I run fsck, it
> reports one corrupt object, then stops. I could write a script to repeatedly
> call fsck and then remove the next corrupt object, but it raises the
> question for me; could it make sense to extend fsck with the option to do to
> the removes?

I am positive to this idea. Yesterday a colleague of mine came to me
with a repo containing a single corrupt object (in a 1.2GB packfile).
We were lucky, since we had a copy of the repo with a good copy of the
same object. However, we were lucky in a couple of other respects, as
well:

I simply copied the packfile containing the good copy into the
corrupted repo, and then ran a "git gc", which "happened" to use the
good copy of the corrupted object and complete successfully (instead
of barfing on the bad copy). The GC then removed the old
(now-obsolete) packfiles, and thus the corruption was gone.

However, exactly _why_ git happened to prefer the good copy in my
copied packfile instead of the bad copy in the existing packfile, I do
not know. I suspect some amount of pure luck was involved. Indeed, I
feared I would have to explode the corrupt pack, then manually replace
the )(now-loose) bad copy with a good copy (from a similarly exploded
pristine pack), and then finally repack everything again. That said,
I'm not at all sure that Git would be able to successfully explode a
pack containing corrupt objects...

I think a better solution would be to tell fsck to remove the corrupt
object(s), as you suggest above, and then copy in the good pack. In
that case, there would be no question that the good copy would be used
in the subsequent GC.

> Or even better, do the removes and then do the necessary
> [r]sync, assuming the user has another repository that has a good copy of
> the bad objects, which in this case I do.

Hmm. I am not sure we want to automate the syncing step. First, git
cannot know _which_ remote is likely to have a good copy of the bad
object. Second, we do not necessarily know what caused the corruption
in the first place, and whether syncing with a remote (which will
create certain amount of write activity on a possibly dying disk
drive) is a good idea at all. Finally, this syncing step will have to
bypass Git's usual reachability analysis (which easily skips fetching
a corrupt blob from otherwise-reachable history), is more involved
than simply calling out to "git fetch"...

...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html