Re: [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Thu, 9 Mar 2017 13:38:35 -0500

On 3/9/2017 2:56 AM, Jeff King wrote:
On Wed, Mar 08, 2017 at 03:10:54PM -0500, Jeff Hostetler wrote:

Even though I do very much like the basic "high level" premise to
omit often useless large blobs that are buried deep in the history
we would not necessarily need from the initial cloning and
subsequent fetches, I find it somewhat disturbing that the code
"Assume"s that any missing blob is due to an previous partial clone.
Adding this option smells like telling the users that they are not
supposed to run "git fsck" because a partially cloned repository is
inherently a corrupt repository.

Can't we do a bit better?  If we want to make the world safer again,
what additional complexity is required to allow us to tell the
"missing by design" and "corrupt repository" apart?

I'm open to suggestions here.  It would be nice to extend the
fetch-pack/upload-pack protocol to return a list of the SHAa
(and maybe the sizes) of the omitted blobs, so that a partial
clone or fetch would still be able to be integrity checked.

Yeah, the early external-odb patches did this. It lets you do a more
accurate fsck, and it also helps diff avoid faulting in large-object
cases (because we can mark them as binary for "free" by comparing the
size to big_file_threshold).

So I think it makes a lot of sense in the large-blob case, where
transmitting a type/size/sha1 tuple is way more efficient than sending
the blob itself. But it's less clear for "sparse" cases where just
enumerating the set of blobs may be prohibitively large.

I have a feeling that the "sparse" thing needs to be handled separately
from "partial". IOW, the client needs to tell the server "I'm only
interested in the path foo/bar, so just send that". Then you don't find
out about the types and sizes outside of that path, but you don't need
to; the sparse path is stored locally and fsck knows to avoid looking
into it.

-Peff

That makes sense.  I'd like to get both concepts (by-size/special vs
sparse-file) in, but they don't really overlap that much (internally).
So I could see doing this in 2 separate efforts.

Thanks,
Jeff