On 6/14/2022 12:33 PM, Richard Oliver wrote: > On 14/06/2022 15:14, Derrick Stolee wrote: >> On 6/14/2022 9:36 AM, Richard Oliver wrote: >>> Do not use oid_object_info() to determine object type in mktree_line() >>> as this can cause promised objects to be dynamically faulted-in one at a >>> time which has poor performance. Instead, use a combination of >>> oid_object_info_extended() (with OBJECT_INFO_SKIP_FETCH_OBJECT option), >>> and the newly introduced promisor_object_type() to determine object type >>> before defaulting to fetch from remote. >> >> Have you run some performance tests on this? It seems like this will >> scan all packed objects, which is probably much slower than just asking >> the remote for the object in most cases. >> >> Thanks, >> -Stolee > > > Hi Stolee, > > I've put together a synthetic experiment below (adding a new blob to anexisting tree) to show you the behaviour that we've been seeing. Our > actual use-case (where we first encountered this behaviour) is updating > submodules to a known hash. As you can see, the round-trip time of fetching > objects one-by-one is very expensive. > > Before, using system git (git version 2.32.0 (Apple Git-132)): > >> $ git init >> # Fetch a recent tree >> $ git fetch --filter=tree:0 --depth 1 https://github.com/git/git cdb48006b0ec7fe19794daf7b5363ab42d9d9371 >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 13.12 KiB | 13.12 MiB/s, done. >> From https://github.com/git/git >> * branch cdb48006b0ec7fe19794daf7b5363ab42d9d9371 -> FETCH_HEAD >> >> $ NEW_BLOB=$(echo zzz | git hash-object --stdin -w) >> >> $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 334 bytes | 334.00 KiB/s, done. >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 2.01 KiB | 2.01 MiB/s, done. >> remote: Enumerating objects: 1, done. >> remote: Counting objects: 100% (1/1), done. >> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 >> Receiving objects: 100% (1/1), 256 bytes | 256.00 KiB/s, done. >> # ... >> # SOME TIME LATER >> # ... >> e26c7ce7357b1649da7b4200d4e80d0b668db8d4 >> 286.49 real 15.66 user 15.59 sys I see. The problem here is that we are making _many requests_ for the same tree, so maybe it would be better to introduce a batched download for the list of missing objects. This would require iterating over the objects for the tree to check existence (in quick mode) and adding the missing ones in a list, then requesting that set altogether in a single request. That probably won't be as fast as your modified mktree experiment below, but would match my expectations of "probably faster assuming the repo is big enough". > Repeated experiment, but using modified mktree: > >> $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree >> e26c7ce7357b1649da7b4200d4e80d0b668db8d4 >> 0.06 real 0.01 user 0.03 sys > > Did you have any other sort of performance test in mind? The remotes we > typically deal with are geographically far away and deal with a high volume > of traffic so we're keen to move behaviour to the client where it makes sense > to do so. I guess I wonder how large your promisor pack-files are in this test, since your implementation depends on for_each_packed_object(), which should be really inefficient if you're actually dealing with a large partial clone. Thanks, -Stolee