Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 21 Sep 2017 15:51:50 -0700

On Thu, 21 Sep 2017 13:59:43 -0400
Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:

> (part 2)
> 
> Additional overall comments on:
> https://github.com/jonathantanmy/git/commits/partialclone2
> 
> {} I think it would help to split the blob-max-bytes filtering and the
>     promisor/promised concepts and discuss them independently.
> 
>     {} Then we can talk about about the promisor/promised
> functionality independent of any kind of filter.  The net-net is that
> the client has missing objects and it doesn't matter what filter
> criteria or mechanism caused that to happened.
> 
>     {} blob-max-bytes is but one such filter we should have.  This
> might be sufficient if the goal is replace LFS (where you rarely ever
>        need any given very very large object) and dynamically loading
>        them as needed is sufficient and the network round-trip isn't
>        too much of a perf penalty.
> 
>     {} But if we want to do things like a "sparse-enlistments" where
> the client only needs a small part of the tree using sparse-checkout.
>        For example, only populating 50,000 files from a tree of 3.5M
> files at HEAD, then we need a more general filtering.
> 
>     {} And as I said above, how we chose to filter should be
> independent of how the client handles promisor/promised objects.

I agree that they are independent. (I put everything together so that
people could see how they work together, but they can be changed
independently of each other.)

> {} Also, if we rely strictly on dynamic object fetching to fetch
> missing objects, we are effectively tethered to the server during
> operations (such as checkout) that the user might not think about as
> requiring a network connection.  And we are forced to keep the same
> limitations of LFS in that you can't prefetch and go offline (without
> actually checking out to your worktree first).  And we can't bulk or
> parallel fetch objects.

I don't think dynamic object fetching precludes any other more optimized
way of fetching or prefetching - I implemented dynamic object fetching
first so that we would have a fallback, but the others definitely can be
implemented too.

> {} I think it would also help to move the blob-max-bytes calculation
> out of pack-objects.c : add_object_entry() [1].  The current code
> isolates the computation there so that only pack-objects can do the
> filtering.
> 
>     Instead, put it in list-objects.c and traverse_commit_list() so
> that pack-objects and rev-list can share it (as Peff suggested [2] in
>     response to my first patch series in March).
> 
>     For example, this would let the client have a pre-checkout hook,
> use rev-list to compute the set of missing objects needed for that
> commit, and pipe that to a command to BULK fetch them from the server
> BEFORE starting the actual checkout.  This would allow the savy user
> to manually run a prefetch before going offline.
> 
> [1]
> https://github.com/jonathantanmy/git/commit/68e529484169f4800115c5a32e0904c25ad14bd8#diff-a8d2c9cf879e775d748056cfed48440cR1110
> 
> [2]
> https://public-inbox.org/git/20170309073117.g3br5btsfwntcdpe@xxxxxxxxxxxxxxxxxxxxx/

In your specific example, how would rev-list know, on the client, to
include (or exclude) a large blob in its output if it does not have it,
and thus does not know its size?

My reason for including it in pack-objects.c is because I only needed it
there and it is much simpler, but I agree that if it can be used
elsewhere, we can put it in a more general place.

> {} This also locks us into size-only filtering and makes it more
>     difficult to add other filters.  In that the add_object_entry()
>     code gets called on an object after the traversal has decided
>     what to do with it.  It would be difficult to add tree-trimming
>     at this level, for example.

That is true.

> {} An early draft of this type of filtering is here [3].  I hope to
> push up a revised draft of this shortly.
> 
> [3]
> https://public-inbox.org/git/20170713173459.3559-1-git@xxxxxxxxxxxxxxxxx/

OK - I'll take a look when that is done (I think I commented on an
earlier version on that).