On 14 Jan 2022, at 7:11, Ævar Arnfjörð Bjarmason wrote: > On Thu, Jan 13 2022, John Cai via GitGitGadget wrote: > >> From: John Cai <johncai86@xxxxxxxxx> >> >> swap out oid_array for oidset in promisor-remote.c since we get >> a deduplicated set of oids to operate on. >> >> swap our calls for oid_array for oidset in callers of >> promisor_remote_get_direct(). >> >> Helped-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> > > FWIW we had an off-list discussion about oidset v.s. a custom hashmap to > do the same off-list, not about s/oid_array/oidset/ in this area. > >> Signed-off-by: John Cai <johncai86@xxxxxxxxx> >> --- >> promisor-remote.c: use oidset for deduplication >> >> > > Extra \n between $subject and $body. > >> swap out oid_array for oidset in promisor-remote.c since we get a > > Nit: s/swap/Swap/, i.e. start the sentence with a capital letter... > >> deduplicated set of oids to operate on. Any direct fetch we can save >> will be well worth it. >> >> oidset does use a slightly larger memory footprint than oid_array, so >> this change is a tradeoff between memory footprint and network calls. >> >> What I'm not sure about is if it's worth swapping out all the calls of >> oid_array for oidset in callers of promisor_remote_get_direct(). > > From what I understand of GGG I think you updated only the summary on > the PR but not the commit itself, the latter is what would go into > git.git. Here the commit should be updated so we get this message. > > The part after the "---" is usually just used in this project for ad-hoc > list-only comment. > >> builtin/index-pack.c | 9 +++--- >> builtin/pack-objects.c | 9 +++--- >> diff.c | 13 +++----- >> diffcore-rename.c | 12 +++---- >> diffcore.h | 3 +- >> merge-ort.c | 8 ++--- >> object-file.c | 4 ++- >> promisor-remote.c | 72 ++++++++++++++---------------------------- >> promisor-remote.h | 6 ++-- >> read-cache.c | 9 +++--- >> 10 files changed, 57 insertions(+), 88 deletions(-) > > Rather than inline comments, a comment that applies to all of these: > > The difference between these two APIs is thaht oidset is hash-backed, > and you'd insert into it and we de-duplicate any duplicate OIDs on-the-fly. > > The oid_array is just an realloc()'d "struct object_id *oid". On > insertion you can insert duplicates, but it has the ability to track > "I've sorted these", and "let's iterate over this sorted, and de-dup any > duplicates". > > We have the two APIs for a reason, but I don't know in any of these > cases whether this change is safe. To reduce the footprint of this change, maybe we can just keep oid_array and sort it in diff.c before passing it off to promisor_remote_get_direct(), in which we then iterate over unique entries. > > Does e.g. index-pack.c always receive de-duplicated OIDs and we were > wasting CPU cycles using an oidset? Good question. In fact a more immediate question is how often does diff.c receive dupe oids, which was the original motivation for the change. cc’ing Jonathan as he was involved with 95acf11a that added the NEEDSWORK block. Jonathan, how often do we expect diff to pass duplicate oids to promisor_remote_get_direct()? > > Do some of these like pack-objects.c receive de-duplicated OIDs from > e.g. "git repack" *now*, but we just lack test coverage to see that > they're happy to get duplicate OIDs on stdin (e.g. manually from a > user), and this would introduce a bug? > > But most importantly is it worth it? What's the rationale for the > change? Less CPU/memory use? Getting e.g. "hyperfine" or "/usr/bin/time > -v" output for those (if so) would be valuable. I agree this would be very valuable, but I’m thinking more so once I get a better idea of what kind of situations would lead to duplicate oids in diff.c