Re: git-blame extremely slow in partial clones due to serial object fetching

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 20 Nov 2024 10:52:28 -0800

Burke Libbey <burke.libbey@xxxxxxxxxxx> writes:
> The core issue appears to be in fill_origin_blob(), which is called
> individually for each blob needed during the blame process. While the blame
> algorithm does need blob contents to make detailed line-matching decisions,
> it seems like we don't necessarily need the contents just to determine which 
> blobs we'llexamine.

Technically, we do need the contents, because the contents determine
whether we are done with the blame (all lines are accounted for)
and whether we need to start looking at the blob at a different path
(because there was a rename).

> It seems like this could be optimized by batch-fetching the needed objects
> upfront, rather than fetching them one at a time. This would convert O(n)
> round-trips into a small number of batch fetches.

That is one possible way (assuming you mean that whenever "git blame"
notices that a blob is missing, it should walk the commits until a
certain depth, collecting all the object IDs for a given path, and
prefetching all of them). This runs the risk of overfetching, as I
stated above, but perhaps overfetching is an acceptable tradeoff for
speed.

There are other ways:

 - If we can teach the client to collect object IDs for prefetching,
   perhaps it would be just as easy to teach the server. We could
   instead make filter-by-path an acceptable argument to pass to "fetch
   --filter", then teach the lazy fetch to use that argument. This also
   opens the door to future performance improvements - since the server
   has all the objects, it can give us precisely the objects that we
   need, and not just give us a quantity of objects based on a heuristic
   (so the client does not need to say "give me 10, and if I need more,
   I'll ask you again", but can say "give me all I need to complete
   the blame). This, however, relies on server implementers to implement
   and turn on such a feature.

 - We could also teach the server to "blame" a file for us and then
   teach the client to stitch together the server's result with the
   local findings, but this is more complicated.

It may also be possible that even if we fix this issue, the scale of the
repos involved might be such that a user would rather "blame" over the
network (e.g. using a web UI) than download all the relevant blobs (even
if the blobs were batched into one download).

So...there are ideas for solutions, but I don't think anyone has
analyzed them (or tried them) yet.