Re: [PATCH] promisor-remote.c: use oidset for deduplication

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 14 Jan 2022 13:11:57 +0100

On Thu, Jan 13 2022, John Cai via GitGitGadget wrote:

> From: John Cai <johncai86@xxxxxxxxx>
>
> swap out oid_array for oidset in promisor-remote.c since we get
> a deduplicated set of oids to operate on.
>
> swap our calls for oid_array for oidset in callers of
> promisor_remote_get_direct().
>
> Helped-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>

FWIW we had an off-list discussion about oidset v.s. a custom hashmap to
do the same off-list, not about s/oid_array/oidset/ in this area.

> Signed-off-by: John Cai <johncai86@xxxxxxxxx>
> ---
>     promisor-remote.c: use oidset for deduplication
>     
>     

Extra \n between $subject and $body.

>     swap out oid_array for oidset in promisor-remote.c since we get a

Nit: s/swap/Swap/, i.e. start the sentence with a capital letter...

>     deduplicated set of oids to operate on. Any direct fetch we can save
>     will be well worth it.
>     
>     oidset does use a slightly larger memory footprint than oid_array, so
>     this change is a tradeoff between memory footprint and network calls.
>     
>     What I'm not sure about is if it's worth swapping out all the calls of
>     oid_array for oidset in callers of promisor_remote_get_direct().

>From what I understand of GGG I think you updated only the summary on
the PR but not the commit itself, the latter is what would go into
git.git. Here the commit should be updated so we get this message.

The part after the "---" is usually just used in this project for ad-hoc
list-only comment.

>  builtin/index-pack.c   |  9 +++---
>  builtin/pack-objects.c |  9 +++---
>  diff.c                 | 13 +++-----
>  diffcore-rename.c      | 12 +++----
>  diffcore.h             |  3 +-
>  merge-ort.c            |  8 ++---
>  object-file.c          |  4 ++-
>  promisor-remote.c      | 72 ++++++++++++++----------------------------
>  promisor-remote.h      |  6 ++--
>  read-cache.c           |  9 +++---
>  10 files changed, 57 insertions(+), 88 deletions(-)

Rather than inline comments, a comment that applies to all of these:

The difference between these two APIs is thaht oidset is hash-backed,
and you'd insert into it and we de-duplicate any duplicate OIDs on-the-fly.

The oid_array is just an realloc()'d "struct object_id *oid". On
insertion you can insert duplicates, but it has the ability to track
"I've sorted these", and "let's iterate over this sorted, and de-dup any
duplicates".

We have the two APIs for a reason, but I don't know in any of these
cases whether this change is safe.

Does e.g. index-pack.c always receive de-duplicated OIDs and we were
wasting CPU cycles using an oidset?

Do some of these like pack-objects.c receive de-duplicated OIDs from
e.g. "git repack" *now*, but we just lack test coverage to see that
they're happy to get duplicate OIDs on stdin (e.g. manually from a
user), and this would introduce a bug?

But most importantly is it worth it? What's the rationale for the
change? Less CPU/memory use? Getting e.g. "hyperfine" or "/usr/bin/time
-v" output for those (if so) would be valuable.