Re: [PATCH/RFC] fast-import: insert new object entries at start of hash bucket

Shawn Pearce <spearce@xxxxxxxxxxx> · Tue, 23 Nov 2010 14:33:52 -0800

On Mon, Nov 22, 2010 at 11:53 PM, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
> More often than not, find_object is called for recently inserted objects.
> Optimise for this case by inserting new entries at the start of the chain.
> This doesn't affect the cost of new inserts but reduces the cost of find
> and insert for existing object entries.

This is a good change.  I totally agree with it.

> I would guess other workflows would display the same tendency if any;
> at least, the same blob is likely to come up repeatedly in a few revs
> and then not be mentioned for a while.  Though maybe there is some
> reason to make the opposite assumption that I have not thought of.
> Shawn?

The new entry was put onto the end of the hash chain because I wasn't
thinking when I wrote that code.  For some reason I assumed it made
sense to put it onto the end of the chain, because I had already
walked the chain to see if it already exists... and by the time I
reached the end, I had the tail, so I might as well put the new item
on to the tail.

Moving the new item to the head works because it is somewhat likely
that the new item will be asked for in the near future, and older
items are less likely to be asked for.  So I can see how this can
offer a small CPU time improvement for a very big import.

> A more dramatic improvement comes from increasing the size of the hash
> table; in particular, David noticed it allows the CPU usage to
> increase from ~60% to 100% of one core.  So presumably we should make
> the size of the hash table dynamic.

Uhm.  Wait, the table isn't already dynamic?  I cheated that badly and
made the table fixed size?

I can tell you why its fixed size... when I wrote fast-import to
support the Mozilla repository import, we had an estimate on the
number of objects that we needed fast-import to handle in a given run.
 From that estimate we concluded that a table of 2^16 was sufficiently
large enough, as the hash chains would only be some small N long given
the total number of objects we needed to handle for Mozilla.  Doubling
that into 2^17 or larger wasn't useful, and using a smaller table like
2^14 produced too long of a chain.

Once this code was working, we moved on to other features, and never
reconsidered the table size.  This table should be growing if the
initial 2^16 isn't big enough.  Its just that nobody has ever noticed
that I hardcoded the size.  :-)

> Other aspects to investigate: choice of hash function;

Why?  SHA-1 is pretty uniform in its distribution.  The object_table
is taking the first 16 bits and using that to index the object, that's
the first 4 hex digits.  The Linux kernel uniquely identifies an
object with under 10 hex digits, or 40 leading bits.  The table just
needs to double or quadruple its size (and have the hash function
include that many more bits) when an arbitrary bucket's chain length
gets over some reasonable constant number (e.g. 16).  We probably do
need to cap the table size though, especially on 32 bit builds of Git.

> is it worth
> moving accessed entries to the front of hash chains when they already
> exist?

Maybe, maybe not?  It depends on what the front end is doing.

If the front end is feeding hex SHA-1s, and will frequently reuse the
same blobs, yes.  Otherwise, maybe not.  The only time we are looking
for an object in the object_table is when we are about to insert an
object and need to avoid a duplicate (that's the insert_object code
path), or when we have to page in a tree into the branch cache.

If the branch cache is spilling, you probably should be increasing its
size so that it doesn't spill as often.

The code within store_tree() is really inefficient.  It looks up the
old tree to find out how its stored... we could store that struct
object_entry* inside of the struct tree_content.  We also could avoid
calling store_object() if the tree hasn't been modified.  The way I
read this store_tree() code, every subdirectory is recursed into even
if no modifications were made inside of that subdirectory during the
current commit.  We just need to format the tree, and if the SHA-1 is
the same as the old SHA-1, skip back out without calling
store_object().  That should give you a pretty dramatic reduction on
the number of times the object_table is accessed during an import.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html