Re: [PATCH v6 09/40] Add initial external odb support

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Fri, 29 Sep 2017 13:36:14 -0700

On Wed, 27 Sep 2017 18:46:30 +0200
Christian Couder <christian.couder@xxxxxxxxx> wrote:

> I am ok to split the patch series, but I am not sure that 01/40 to
> 09/40 is the right range for the first patch series.
> I would say that 01/40 to 07/40 is better as it can be seen as a
> separate refactoring.

I mentioned 09/40 because this is (as far as I can tell) the first one
that introduces a new design.

> I don't think single-shot processes would be a huge burden, because
> the code is simpler, and because for example for filters we already
> have single shot and long-running processes and no one complains about
> that. It's code that is useful as it makes it much easier for people
> to do some things (see the clone bundle example).
> 
> In fact in Git development we usually start to by first implementing
> simpler single-shot solutions, before thinking, when the need arise,
> to make it faster. So a perhaps an equally valid opinion could be to
> first only submit the patches for the single-shot protocol and later
> submit the rest of the series when we start getting feedback about how
> external odbs are used.

My concern is that, as far as I understand about the Microsoft use case,
we already know that we need the faster solution, so the need has
already arisen.

> And yeah I could change the order of the patch series to implement the
> long-running processes first and the single-shot process last, so that
> it could be possible to first get feedback about the long-running
> processes, before we decide to merge or not the single-shot stuff, but
> I don't think it would look like the most logical order.

My thinking was that we would just implement the long-running process
and not implement the single-shot process at all (besides maybe a script
in contrib/). If we are going to do both anyway, I agree that we should
do the single-shot process first.

> > And another possible issue is that we design ourselves into a corner.
> > Thinking about the use cases that I know about (the Android use case and
> > the Microsoft GVFS use case), I don't think we are doing that - for
> > Android, this means that large blob metadata needs to be part of the
> > design (and this patch series does provide for that), and for Microsoft
> > GVFS, "get" is relatively cheap, so a configuration option to not invoke
> > "have" first when loading a missing object might be sufficient.
> 
> If the helper does not advertise the "have" capability, the "have"
> instruction will not be sent to the helper, so the current design is
> already working for that case.

Ah, that's good to know.

> > And I think that my design can be extended to support a use case in
> > which, for example, blobs corresponding to a certain type of filename
> > (defined by a glob like in gitattributes) can be excluded during
> > fetch/clone, much like --blob-max-bytes, and they can be fetched either
> > through the built-in mechanism or through a custom hook.
> 
> Sure, we could probably rebuild something equivalent to what I did on
> top of your design.
> My opinion though is that if we want to eventually get to the same
> goal, it is better to first merge something that get us very close to
> the end goal and then add some improvements on top of it.

I agree - I mentioned that because I personally prefer to review smaller
patch sets at a time, and my patch set already includes a lot of the
same infrastructure needed by yours - for example, the places in the
code to dynamically fetch objects, exclusion of objects when fetching or
cloning, configuring the cloned repo when cloning, fsck, and gc.

> >  - I get compile errors when I "git am" these onto master. I think
> >    '#include "config.h"' is needed in some places.
> 
> It's strange because I get no compile errors even after a "make clean"
> from my branch.
> Could you show the actual errors?

I don't have the error messages with me now, but it was something about
a function being implicitly declared. You will probably get these errors
if you sync past commit e67a57f ("config: create config.h", 2017-06-15).

> > Any reason why you prefer to update the loose object functions than to
> > update the generic one (sha1_object_info_extended)? My concern with just
> > updating the loose object functions was that a caller might have
> > obtained the path by iterating through the loose object dirs, and in
> > that case we shouldn't query the external ODB for anything.
> 
> You are thinking about fsck or gc?
> Otherwise I don't think it would be clean to iterate through loose object dirs.

Yes, fsck and gc (well, prune, I think) do that. I agree that Git
typically doesn't do that (except for exceptional cases like fsck and
gc), but I was thinking about supporting existing code that does that
iteration, not introducing new code that does that.