On 26.10.2020 19:24, Jonathan Tan wrote:
Sorry for the late reply - I have been out of office for a while.
I'm quite happy to get the replies at all, even if later. Thanks!
As Taylor said in another email, it's good for some use cases but perhaps not for the "blame" one that you describe later.
OK, so our expectations seem to match your expectations, that's good.
That's true. I made some progress with cbe566a071 ("negotiator/noop: add noop fetch negotiator", 2020-08-18) (which adds a no-op negotiatior, so the client never reports its own commits as "have") but as you said in another email, we still run into the problem that if we have the commit that we're fetching, we still won't fetch it.
Right, I already discovered 'fetch.negotiationAlgorithm=noop' and gave it a quick try, but it didn't seem to help at all.
To clarify: we partially support the last point - "git clone" now supports "--sparse". When used with "--filter", only the blobs in the sparse checkout specification will be fetched, so users are already able to download only the objects in a specific path.
I see. Still, it seems that two other problems will be solved.
Having said that, I think you also want the histories of these objects, so admittedly this is not complete for your use case.
Right.
Having such an option (and teaching "blame" to use it to prefetch) would indeed speed up "blame". But if we implement this, what would happen if the user ran "blame" on the same file twice? I can't think of a way of preventing the same fetch from happening twice except by checking the existence of, say, the last 10 OIDs corresponding to that path. But if we have the list of those 10 OIDs, we could just prefetch those 10 OIDs without needing a new filter.
I must admit that I didn't notice this problem. Still, it seems easy enough to solve with this approach:
1) Estimate number of missing things 2) If "many", just download everything for <path> as described before and consider it done. 3) If "not so many", assemble a list of OIDs on the boundary of unknown (for example, all root tree OIDs for commits that are missing any trees) and use the usual fetch to download all OIDs in one go. 4) Repeat step 3 multiple times. Only N=<maximum tree depth> requests are needed, regardless of the number of commits.
Another issue (but a smaller one) is this does not fetch all objects necessary if the file being "blame"d has been renamed, but that is probably solvable - we can just refetch with the old name.
Right, we also discussed this and figured that we'd just query more things as needed. Maybe also individual other blobs for rename detection.
Another possible solution that has been discussed before (but a much more involved one) is to teach Git to be able to serve results of computations, and then have "blame" be able to stitch that with local data. (For example, "blame" could check the history of a certain path to find the commit(s) that the remote has information of, query the remote for those commits, and then stitch the results together with local history.) This scheme would work not only for "blame" but for things like "grep" (with history) and "log -S", whereas "--filter=sparse:parthlist" would only work with "blame". But admittedly, this solution is more involved.
I understand that you're basically talking about implementing prefetching in git itself? To my understanding, this will still need either the command I suggested, or implement graph walking with massive OID requests as described above in 1)2)3)4). The latter will not require protocol changes, but will involve sending quite a bit of OIDs around.