Re: [PATCH v6 09/40] Add initial external odb support

Christian Couder <christian.couder@xxxxxxxxx> · Wed, 27 Sep 2017 18:46:30 +0200

On Tue, Sep 19, 2017 at 7:45 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
> I wonder if it's better to get a change like this (PATCH v6 09/40 and
> any of the previous patches that this depends on) in and then build on
> it rather than to review the whole patch set at a time.  This would
> reduce ripple effects (of needing to change later patches in a patch set
> multiple times unnecessarily) and help collaboration (in that multiple
> people can write patches, since the foundation would already be laid).

I am ok to split the patch series, but I am not sure that 01/40 to
09/40 is the right range for the first patch series.
I would say that 01/40 to 07/40 is better as it can be seen as a
separate refactoring.

> The same concerns about fsck apply, but that shouldn't be a problem,
> since this patch provides the same internal API as mine ("get" function
> taking in a single hash, and "have" function taking in a single hash) so
> it shouldn't be too difficult to adapt my fsck and gc patches [1]. (I
> can do that if necessary.)
>
> [1] https://public-inbox.org/git/20170915134343.3814dc38@xxxxxxxxxxxxxxxxxxxxxxxxxxx/

Great! I would be happy with such an outcome.

> One possible issue (with committing something early) is that later work
> (for example, a fast long-running process protocol) will make the
> earlier work (for example, here, a simple single-shot protocol)
> obsolete, while saddling us with the necessity of maintaining the
> earlier one. To that end, if we want to start with the support for a
> hook, a better approach might be to only code the fast long-running
> process protocol, and put a wrapper script in contrib/ that can wrap a
> single-shot process in a long-running process.

I don't think single-shot processes would be a huge burden, because
the code is simpler, and because for example for filters we already
have single shot and long-running processes and no one complains about
that. It's code that is useful as it makes it much easier for people
to do some things (see the clone bundle example).

In fact in Git development we usually start to by first implementing
simpler single-shot solutions, before thinking, when the need arise,
to make it faster. So a perhaps an equally valid opinion could be to
first only submit the patches for the single-shot protocol and later
submit the rest of the series when we start getting feedback about how
external odbs are used.

And yeah I could change the order of the patch series to implement the
long-running processes first and the single-shot process last, so that
it could be possible to first get feedback about the long-running
processes, before we decide to merge or not the single-shot stuff, but
I don't think it would look like the most logical order.

> And another possible issue is that we design ourselves into a corner.
> Thinking about the use cases that I know about (the Android use case and
> the Microsoft GVFS use case), I don't think we are doing that - for
> Android, this means that large blob metadata needs to be part of the
> design (and this patch series does provide for that), and for Microsoft
> GVFS, "get" is relatively cheap, so a configuration option to not invoke
> "have" first when loading a missing object might be sufficient.

If the helper does not advertise the "have" capability, the "have"
instruction will not be sent to the helper, so the current design is
already working for that case.

> As for the design itself (including fetch and clone), it differs from my
> patches (linked above as [1]) in that mine is self-contained (requiring
> only an updated Git server and Git client) whereas this, as far as I can
> tell, requires an external process and some measure of coordination
> between the administrator of the server and the client user (for
> example, the client must have the same ODB mechanism as the server, if
> not, the server might omit certain blobs that the client does not know
> how to fetch).

Yeah, your design is more self contained, but it doesn't handle as
many use cases.

> And I think that my design can be extended to support a use case in
> which, for example, blobs corresponding to a certain type of filename
> (defined by a glob like in gitattributes) can be excluded during
> fetch/clone, much like --blob-max-bytes, and they can be fetched either
> through the built-in mechanism or through a custom hook.

Sure, we could probably rebuild something equivalent to what I did on
top of your design.
My opinion though is that if we want to eventually get to the same
goal, it is better to first merge something that get us very close to
the end goal and then add some improvements on top of it.

> For those reasons, I still lean towards my design, but if we do want to
> go with this design, here are my comments about this patch...
>
> First of all:
>  - You'll probably need to add a repository extension.

I am ok to add one if this is needed, but I am not sure it is, as the
repository format does not really change.

>  - I get compile errors when I "git am" these onto master. I think
>    '#include "config.h"' is needed in some places.

It's strange because I get no compile errors even after a "make clean"
from my branch.
Could you show the actual errors?

> On Sat, 16 Sep 2017 10:07:00 +0200
> Christian Couder <christian.couder@xxxxxxxxx> wrote:
>
>> The external-odb.{c,h} files contains the functions that are
>> called by the rest of Git from "sha1_file.c".
>>
>> The odb-helper.{c,h} files contains the functions to
>> actually implement communication with the external scripts or
>> processes that will manage external git objects.
>>
>> For now only script mode is supported, and only the 'have' and
>> 'get_git_obj' instructions are supported.
>
> This "have", as I see from this commit, is more like a "list" command in
> that it lists all hashes that it knows about, and does not check if a
> given hash exists.

Yes.

>> +static struct odb_helper *helpers;
>> +static struct odb_helper **helpers_tail = &helpers;
>
> This could be done with the helpers in list.h instead.

Yeah, but list.h is for a doubly-linked list and I am not sure we need that.

>> +int external_odb_get_object(const unsigned char *sha1)
>> +{
>> +     struct odb_helper *o;
>> +     const char *path;
>> +
>> +     if (!external_odb_has_object(sha1))
>> +             return -1;
>> +
>> +     path = sha1_file_name_alt(external_odb_root(), sha1);
>
> If the purpose of making these functions global in the previous patch is
> just for temporary names, I don't think it's necessary for them to be
> global. Just concatenate the hex SHA1 to external_odb_root()?

In my opinion it is cleaner to make a few functions that are needed
global rather than have some custom code, even if the custom code is
small.
(And I guess that Peff who wrote the above agrees.)

>>  /* Returns 1 if we have successfully freshened the file, 0 otherwise. */
>> @@ -667,7 +684,7 @@ static int check_and_freshen_nonlocal(const unsigned char *sha1, int freshen)
>>               if (check_and_freshen_file(path, freshen))
>>                       return 1;
>>       }
>> -     return 0;
>> +     return external_odb_has_object(sha1);
>>  }
>>
>>  static int check_and_freshen(const unsigned char *sha1, int freshen)
>> @@ -824,6 +841,9 @@ static int stat_sha1_file(const unsigned char *sha1, struct stat *st,
>>                       return 0;
>>       }
>>
>> +     if (!external_odb_get_object(sha1) && !lstat(*path, st))
>> +             return 0;
>> +
>>       return -1;
>>  }
>>
>> @@ -859,7 +879,14 @@ static int open_sha1_file(const unsigned char *sha1, const char **path)
>>       if (fd >= 0)
>>               return fd;
>>
>> -     return open_sha1_file_alt(sha1, path);
>> +     fd = open_sha1_file_alt(sha1, path);
>> +     if (fd >= 0)
>> +             return fd;
>> +
>> +     if (!external_odb_get_object(sha1))
>> +             fd = open_sha1_file_alt(sha1, path);
>> +
>> +     return fd;
>>  }
>
> Any reason why you prefer to update the loose object functions than to
> update the generic one (sha1_object_info_extended)? My concern with just
> updating the loose object functions was that a caller might have
> obtained the path by iterating through the loose object dirs, and in
> that case we shouldn't query the external ODB for anything.

You are thinking about fsck or gc?
Otherwise I don't think it would be clean to iterate through loose object dirs.

>> +ALT_SOURCE="$PWD/alt-repo/.git"
>> +export ALT_SOURCE
>> +write_script odb-helper <<\EOF
>> +GIT_DIR=$ALT_SOURCE; export GIT_DIR
>> +case "$1" in
>> +have)
>> +     git cat-file --batch-check --batch-all-objects |
>> +     awk '{print $1 " " $3 " " $2}'
>> +     ;;
>> +get_git_obj)
>> +     cat "$GIT_DIR"/objects/$(echo $2 | sed 's#..#&/#')
>> +     ;;
>> +esac
>> +EOF
>> +HELPER="\"$PWD\"/odb-helper"
>
> Thanks for the clear test. It is very obvious that "have" returns a list
> of objects, and "get_git_obj" returns the compressed loose object with
> the Git loose object header included.

Happy that you like how simple helpers can be with single-shot processes ;-)

>> +test_expect_success 'setup alternate repo' '
>> +     git init alt-repo &&
>> +     (cd alt-repo &&
>> +      test_commit one &&
>
> Probably better written as "test_commit -C alt-repo one".

Ok, I will use that .

>> +      test_commit two
>> +     ) &&
>> +     alt_head=`cd alt-repo && git rev-parse HEAD`
>
> I think the style is to use $() and "git -C alt-repo rev-parse HEAD".

Ok, I will change that.

Thanks,
Christian.