Re: fast-import's hash table is slow

Jeff King <peff@xxxxxxxx> · Fri, 3 Apr 2020 08:14:17 -0400

On Thu, Apr 02, 2020 at 08:40:58PM +0200, René Scharfe wrote:

> >> Storing the objects directly in the set and getting rid of new_object()
> >> could improve performance due to reduced dereferencing overhead -- or
> >> make it slower because more data has to be copied when the hashmap needs
> >> to grow.  Worth a shot.  Later.
> >
> > Yeah. I tried that, too, but it caused tons of test failures. Quite
> > possibly just a bug in my patch, which I did as quickly as possible. But
> > I think it performed about the same. Here it is for reference (though
> > you may be better off to start from scratch).
> 
> Tried that earlier, ran into failures as well.  I think the pointer
> returned from insert_object() is stored somewhere and still used after
> the next call, which could have invalidated it by a rehash.  E.g.
> insert_mark() seems to do that.

That doesn't surprise me. I didn't look very hard, but mostly just did
the minimum to see if it would work (and it didn't).

It could perhaps be overcome, and I certainly don't mind if you want to
dig further, but I'm happy enough with the hashmap solution.

> > Note the "this is OK to cast from oid to object_entry" comment is
> > leftover from the earlier attempt, but it probably _isn't_ OK. Even
> > though we don't do anything actionable on the non-oid bytes, they do get
> > passed by value, which could mean reading random bytes.
> 
> "Value" meaning pointer value when KHASH_INIT is give a pointer type,
> as was the case in the patch with that comment (unlike in the patch
> below).  So it should be OK there.

Right, I meant the comment no longer applies in the patch below, because
we're not using a pointer type anymore.

> > -	for (o = blocks; o; o = o->next_pool)
> > -		for (e = o->next_free; e-- != o->entries;)
> > +	for (iter = kh_begin(&object_table); iter != kh_end(&object_table); iter++) {
> > +		if (kh_exist(&object_table, iter)) {
> > +			e = &kh_key(&object_table, iter);
> >  			if (pack_id == e->pack_id)
> >  				*c++ = &e->idx;
> > +		}
> > +	}
> 
> The original code writes the objects in reverse order of their creation
> (LIFO), right?  Is that relevant?

Yeah, agreed this is another weakness of this approach.

> But anyway,  we need stable object locations anyway if their addresses are
> stored in other structs, as mentioned above.  Those pointers would have to
> be replaced by object_ids and pointer derefs by hashmap lookups.  Not a
> promising direction.

Agreed.

-Peff