Re: [External] Re: git-blame extremely slow in partial clones due to serial object fetching

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Fri, 22 Nov 2024 09:55:35 -0800

Shubham Kanodia <shubham.kanodia10@xxxxxxxxx> writes:
> 
> 
> On 22/11/24 1:59 pm, Junio C Hamano wrote:
> > Shubham Kanodia <shubham.kanodia10@xxxxxxxxx> writes:
> > 
> >> Junio — would it make sense to add an option (and config) for `git
> >> blame` that limits how far back it looks for fetching blobs?
> > 
> > No, I do not think it would.
> > 
> > What would our workaround for the next one when people say "oh, 'git
> > log -p' fetches blobs on demand and latency kills me"?  Yet another
> > such an option only for 'git log'?
> > 
> 
> I'm guessing `git log` already provides options to limit history using 
> `-n` or `--since` so ideally its not unbounded if you use those, unlike 
> with `git blame`?

`git blame` also has options. See "SPECIFYING RANGES" in its man page,
which teaches you how to specify revision ranges (and also line ranges,
but that is not relevant here).

> I understand our concerns regarding adding new config options though. 
> Between the solutions discussed in this thread — batching, adding server 
> side support, (or another) — what do you think could be a good track to 
> pursue here because this makes using `git blame` on larger partially 
> cloned repos a possible footgun.

Typically questions like this should be answered by the person who is
actually going to pursue the track. If you'd like to pursue a track but
don't know which to pursue, maybe start with what you believe the best
solution to be. It seems that you think that limiting either the number
of blobs fetched or the time range of the commits to be considered is
best, so maybe you could try one of them.

Limiting the time range is already possible, so I'll provide my ideas
aboult limiting the number of blobs fetched. You can detect when
a blob is missing (and therefore needs to be fetched) by a flag in
oid_object_info_extended() (or use has_object()), so you can count the
number of blobs fetched as the blame is being run. My biggest concern
is that there is no good limit - I suspect that for a file that is
extensively changed, 10 blobs is too few and you'll need something like
50 blobs. But 50 blobs means 50 RTTs, which also might be too much for
an end user. But in any case, you know your users' needs better than
we do.