Re: [PATCH v2 5/5] oidtree: a crit-bit tree for odb_loose_cache

Eric Wong <e@xxxxxxxxx> · Tue, 6 Jul 2021 23:21:59 +0000

René Scharfe <l.s.r@xxxxxx> wrote:
> Am 29.06.21 um 22:53 schrieb Eric Wong:
> > --- a/alloc.c
> > +++ b/alloc.c
> > @@ -14,6 +14,7 @@
> >  #include "tree.h"
> >  #include "commit.h"
> >  #include "tag.h"
> > +#include "oidtree.h"
> >  #include "alloc.h"
> >
> >  #define BLOCKING 1024
> > @@ -123,6 +124,11 @@ void *alloc_commit_node(struct repository *r)
> >  	return c;
> >  }
> >
> > +void *alloc_from_state(struct alloc_state *alloc_state, size_t n)
> > +{
> > +	return alloc_node(alloc_state, n);
> > +}
> > +
> 
> Why extend alloc.c instead of using mem-pool.c?  (I don't know which fits
> better, but when you say "memory pool" and not use mem-pool.c I just have
> to ask..)

I didn't know mem-pool.c existed :x  (And I've always known
about alloc.c).

Perhaps we could merge them in another series to avoid further
confusion.

> > +void oidtree_insert(struct oidtree *ot, const struct object_id *oid)
> > +{
> > +	struct oidtree_node *on;
> > +
> > +	if (!ot->mempool)
> > +		ot->mempool = allocate_alloc_state();
> > +	if (!oid->algo)
> > +		BUG("oidtree_insert requires oid->algo");
> > +
> > +	on = alloc_from_state(ot->mempool, sizeof(*on) + sizeof(*oid));
> > +	oidcpy_with_padding((struct object_id *)on->n.k, oid);
> > +
> > +	/*
> > +	 * n.b. we shouldn't get duplicates, here, but we'll have
> > +	 * a small leak that won't be freed until oidtree_destroy
> > +	 */
> 
> Why shouldn't we get duplicates?  That depends on the usage of oidtree,
> right?  The current user is fine because we avoid reading the same loose
> object directory twice using the loose_objects_subdir_seen bitmap.

Yes, it reflects the current caller.

> The leak comes from the allocation above, which is not used in case we
> already have the key in the oidtree.  So we need memory for all
> candidates, not just the inserted candidates.  That's probably
> acceptable in most use cases.

Yes, I think the small, impossible-due-to-current-usage leak is
an acceptable trade off.

> We can do better by keeping track of the unnecessary allocation in
> struct oidtree and recycling it at the next insert attempt, however.
> That way we'd only waste at most one slot.

It'd involve maintaining a free list; which may be better
suited to being in alloc_state or mem_pool.  That would also
increase the size of a struct *somewhere* and add a small
amount of code complexity, too.

> > +	cb_insert(&ot->t, &on->n, sizeof(*oid));
> > +}
> > +
> > +int oidtree_contains(struct oidtree *ot, const struct object_id *oid)
> > +{
> > +	struct object_id k = { 0 };
> > +	size_t klen = sizeof(k);
> > +	oidcpy_with_padding(&k, oid);
> 
> Why initialize k; isn't oidcpy_with_padding() supposed to overwrite it
> completely?

Ah, I only added oidcpy_with_padding later into the development
of this patch.

> > +
> > +	if (oid->algo == GIT_HASH_UNKNOWN) {
> > +		k.algo = hash_algo_by_ptr(the_hash_algo);
> > +		klen -= sizeof(oid->algo);
> > +	}
> 
> This relies on the order of the members hash and algo in struct
> object_id to find a matching hash if we don't actually know algo.  It
> also relies on the absence of padding after algo.  Would something like
> this make sense?
> 
>    BUILD_ASSERT_OR_ZERO(offsetof(struct object_id, algo) + sizeof(k.algo) == sizeof(k));

Maybe... I think a static assertion that object_id.hash be the
first element of "struct object_id" is definitely needed, at
least.

> And why set k.algo to some arbitrary value if we ignore it anyway?  I.e.
> why not keep it GIT_HASH_UNKNOWN, as set by oidcpy_with_padding()?

Good point, shortening klen would've been all that was needed.

> > +void oidtree_each(struct oidtree *ot, const struct object_id *oid,
> > +			size_t oidhexlen, oidtree_iter fn, void *arg)
> > +{
> > +	size_t klen = oidhexlen / 2;
> > +	struct oidtree_iter_data x = { 0 };
> > +
> > +	x.fn = fn;
> > +	x.arg = arg;
> > +	x.algo = oid->algo;
> > +	if (oidhexlen & 1) {
> > +		x.last_byte = oid->hash[klen];
> > +		x.last_nibble_at = &klen;
> > +	}
> > +	cb_each(&ot->t, (const uint8_t *)oid, klen, iter, &x);
> > +}
> 
> Clamp oidhexlen at GIT_MAX_HEXSZ?  Or die?

I think an assertion would be enough, here.
init_object_disambiguation already clamps to the_hash_algo->hexsz