On Wed, Mar 08, 2017 at 03:10:54PM -0500, Jeff Hostetler wrote: > > Even though I do very much like the basic "high level" premise to > > omit often useless large blobs that are buried deep in the history > > we would not necessarily need from the initial cloning and > > subsequent fetches, I find it somewhat disturbing that the code > > "Assume"s that any missing blob is due to an previous partial clone. > > Adding this option smells like telling the users that they are not > > supposed to run "git fsck" because a partially cloned repository is > > inherently a corrupt repository. > > > > Can't we do a bit better? If we want to make the world safer again, > > what additional complexity is required to allow us to tell the > > "missing by design" and "corrupt repository" apart? > > I'm open to suggestions here. It would be nice to extend the > fetch-pack/upload-pack protocol to return a list of the SHAa > (and maybe the sizes) of the omitted blobs, so that a partial > clone or fetch would still be able to be integrity checked. Yeah, the early external-odb patches did this. It lets you do a more accurate fsck, and it also helps diff avoid faulting in large-object cases (because we can mark them as binary for "free" by comparing the size to big_file_threshold). So I think it makes a lot of sense in the large-blob case, where transmitting a type/size/sha1 tuple is way more efficient than sending the blob itself. But it's less clear for "sparse" cases where just enumerating the set of blobs may be prohibitively large. I have a feeling that the "sparse" thing needs to be handled separately from "partial". IOW, the client needs to tell the server "I'm only interested in the path foo/bar, so just send that". Then you don't find out about the types and sizes outside of that path, but you don't need to; the sparse path is stored locally and fsck knows to avoid looking into it. -Peff