Re: Proposal for missing blob support in Git repos

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Mon, 1 May 2017 12:12:27 -0700

On 04/30/2017 08:57 PM, Junio C Hamano wrote:
One thing I wonder is what the performance impact of a change like
this to the codepath that wants to see if an object does _not_ exist
in the repository.  When creating a new object by hashing raw data,
we see if an object with the same name already exists before writing
the compressed loose object out (or comparing the payload to detect
hash collision).  With a "missing blob" support, we'd essentially
spawn an extra process every time we want to create a new blob
locally, and most of the time that is done only to hear the external
command to say "no, we've never heard of such an object", with a
possibly large latency.

If we do not have to worry about that (or if it is no use to worry
about it, because we cannot avoid it if we wanted to do the lazy
loading of objects from elsewhere), then the patch presented here
looked like a sensible first step towards the stated goal.

Thanks.

Thanks for your comments. If you're referring to the codepath involving 
write_sha1_file() (for example, builtin/hash-object -> index_fd or 
builtin/unpack-objects), that is fine because write_sha1_file() invokes 
freshen_packed_object() and freshen_loose_object() directly to check if 
the object already exists (and thus does not invoke the new mechanism in 
this patch).

Having said that, looking at other parts of the fetching mechanism, 
there are a few calls to has_sha1_file() and others that might need to 
be checked. (We have already discussed one - the one in rev-list when 
invoked to check connectivity.) I could take a look at that, but was 
hoping for discussion on what I've sent so far (so that I know that I'm 
on the right track, and because it somewhat works, albeit slowly).