Re: Questions about partial clone with '--filter=tree:0'

Alexandr Miloslavskiy <alexandr.miloslavskiy@xxxxxxxxxxx> · Mon, 26 Oct 2020 21:08:54 +0100

On 26.10.2020 20:46, Jonathan Tan wrote:
> No - I did talk about prefetching earlier, but here I mean having
> Git on the server perform the "blame" computation itself.

Oh! That's an interesting twist. Unfortunately for us, we are
implementing our own Blame logic. Thinking of which, I'm now becoming
more convinced that graph walking could be the best solution for us,
because it allows any logic, including custom file rename detection.

> For example, let's say I want to run "blame" on foo.txt at HEAD. HEAD
> and HEAD^ are commits that only the local client has, whereas HEAD^^ was
> fetched from the remote. By comparing HEAD, HEAD^, and HEAD^^, Git knows
> which lines come from HEAD and HEAD^. For the rest, Git would make a
> request to the server, passing the commit ID and the path, and would get
> back a list of line numbers and commits.

Sounds quite involved indeed! It's curious how git kind of shifts
towards classic server-side VCS such as SVN. When partial clones are
involved, that is.

> Yes, prefetching will require graph walking with large OID requests but
> will not require protocol changes, as you say. I'm not too worried about
> the large numbers of OIDs - Git servers already have to support
> relatively large numbers of OIDs to support the bulk prefetch we do
> during things like checkout and diff.

Hmm, let's talk about Linux repository for the sake of the numbers.
The number of commits is ~1M. For a typical Blame (without rename
detection), every request will traverse the trees one level deeper, and
for just one file blamed, that would mean 1 or 0 trees per commit 
(depending on whether the tree was modified by the commit). The first
request to discover root trees is going to be the largest, and will
request (1*numCommits) OIDs. That makes 1M OIDs in worst case, with
subsequent requests probably at ~0.1M, and there will be 1 request per
every path component in blamed path.

So the question is, will git server (or git hosting) become upset
about requests for 1M OIDs? Never really tried what is the cost of such
request, what do you think?