Re: RFC: Design and code of partial clones (now, missing commits and trees OK) (part 2/3)

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Fri, 22 Sep 2017 17:19:50 -0400

On 9/21/2017 6:51 PM, Jonathan Tan wrote:
On Thu, 21 Sep 2017 13:59:43 -0400
Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> wrote:

(part 2)

Additional overall comments on:
https://github.com/jonathantanmy/git/commits/partialclone2

{} I think it would help to split the blob-max-bytes filtering and the
     promisor/promised concepts and discuss them independently.

     {} Then we can talk about about the promisor/promised
functionality independent of any kind of filter.  The net-net is that
the client has missing objects and it doesn't matter what filter
criteria or mechanism caused that to happened.

     {} blob-max-bytes is but one such filter we should have.  This
might be sufficient if the goal is replace LFS (where you rarely ever
        need any given very very large object) and dynamically loading
        them as needed is sufficient and the network round-trip isn't
        too much of a perf penalty.

     {} But if we want to do things like a "sparse-enlistments" where
the client only needs a small part of the tree using sparse-checkout.
        For example, only populating 50,000 files from a tree of 3.5M
files at HEAD, then we need a more general filtering.

     {} And as I said above, how we chose to filter should be
independent of how the client handles promisor/promised objects.

I agree that they are independent. (I put everything together so that
people could see how they work together, but they can be changed
independently of each other.)

{} Also, if we rely strictly on dynamic object fetching to fetch
missing objects, we are effectively tethered to the server during
operations (such as checkout) that the user might not think about as
requiring a network connection.  And we are forced to keep the same
limitations of LFS in that you can't prefetch and go offline (without
actually checking out to your worktree first).  And we can't bulk or
parallel fetch objects.

I don't think dynamic object fetching precludes any other more optimized
way of fetching or prefetching - I implemented dynamic object fetching
first so that we would have a fallback, but the others definitely can be
implemented too.

yes, we need that as a fallback/default for the odd cases where we
can't predict perfectly.  Like during a blame or history or a merge.

I didn't mean to say we didn't need it, but rather that we should
try to minimize it.

{} I think it would also help to move the blob-max-bytes calculation
out of pack-objects.c : add_object_entry() [1].  The current code
isolates the computation there so that only pack-objects can do the
filtering.

     Instead, put it in list-objects.c and traverse_commit_list() so
that pack-objects and rev-list can share it (as Peff suggested [2] in
     response to my first patch series in March).

     For example, this would let the client have a pre-checkout hook,
use rev-list to compute the set of missing objects needed for that
commit, and pipe that to a command to BULK fetch them from the server
BEFORE starting the actual checkout.  This would allow the savy user
to manually run a prefetch before going offline.

[1]
https://github.com/jonathantanmy/git/commit/68e529484169f4800115c5a32e0904c25ad14bd8#diff-a8d2c9cf879e775d748056cfed48440cR1110

[2]
https://public-inbox.org/git/20170309073117.g3br5btsfwntcdpe@xxxxxxxxxxxxxxxxxxxxx/

In your specific example, how would rev-list know, on the client, to
include (or exclude) a large blob in its output if it does not have it,
and thus does not know its size?

The client doesn't have the size. It just knows it is missing and it
needs it. It doesn't matter why it is missing.  (But I guess the client
could assume it is because it is large.)

So rev-list on the client could filter the objects it has by size.

I added that to rev-list primarily to demonstrate and debug the
filtering concept (it's easier than playing with packfiles).  But
it can be used to drive client-side queries and bulk requests.

My reason for including it in pack-objects.c is because I only needed it
there and it is much simpler, but I agree that if it can be used
elsewhere, we can put it in a more general place.

{} This also locks us into size-only filtering and makes it more
     difficult to add other filters.  In that the add_object_entry()
     code gets called on an object after the traversal has decided
     what to do with it.  It would be difficult to add tree-trimming
     at this level, for example.

That is true.

{} An early draft of this type of filtering is here [3].  I hope to
push up a revised draft of this shortly.

[3]
https://public-inbox.org/git/20170713173459.3559-1-git@xxxxxxxxxxxxxxxxx/

OK - I'll take a look when that is done (I think I commented on an
earlier version on that).

FYI I just posted my RFC this afternoon.
https://public-inbox.org/git/20170922204211.GA24036@xxxxxxxxxx/T/

Thanks
Jeff