On Tue, Jun 1, 2021 at 3:40 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > you want to be able to get something useful to the user as fast as > > possible [...] but where a user might later (eg overnight) want to get the > > rest of the repo, to avoid history inconsistency issues. > > As you describe below, the inconsistency is in terms of performance, > not correctness. I thought it was worth a clarification. Sorry I was not clear here - I did not mean formal correctness nor performance, when referring to the incentive to get the rest of the repo - I was referring to the fact that a medium-shallow clone (eg 15 months of a 20-year project) provides an inconsistent perspective on the code history: * On the one hand, most of the time you have everything you need, and when you bump up against *available* history limits from a file or branch history view, it's reasonably clear that's what's happening (in some UI tools this is more explicit than in others). * On the other hand, when you happen to look at something older, it is easy for the history to seem to "lie", showing changes made in a file by a person that really *didn't* make those changes. Their commit just happened to be selected as the shallow graft, and so seems to have "added" all the files in the project. This reasonably intelligible when looking at file history, but extremely non-obvious when looking at git blame (in a medium-shallow clone). > I'm aware that the first 'git blame' on a file is a bit slow in the > partial clone case. Without wanting to harp on about it, it can easily be pathologically slow, eg in my case a random well-trafficked file has 300 in-scope commits, at 10 seconds per independent blob fetch - and so ends up taking an hour to git blame (the first time for such a file, as you noted). > It's been on my list for improvement whenever I > get the "spare" time to do it. However, if someone else wants to work > on it I will briefly outline the approach I was going to investigate: One reason I wasn't asking about / angling for this, particularly, is that I expect there will be other tools doing their own versions of this. I haven't tested "tig" on this, for example, but I suspect it doesn't do a plain git blame, given what I've seen of its instantly showing the file contents and "gradually" filling in the authorship data. I for one rarely use plain git blame, I don't know much about the usage patterns of other users. Most of "my" users will be using Intellij IDEA, which seems to have a surprisingly solid/scalable git integration (but I have not yet tested this case there yet) There also other related reasons to go for a "get most of the relevant blobs across history" approach, specifically around tooling: there are lots of tools & integrations that use git libraries (or even homebrew implementations) rather than the git binaries / IPC, and many of those tend to lag *far* behind in support for things like shallow clone, partial clone, mailmap, core.splitindex, replace refs, etc etc. My current beef is with Sublime Merge, which is snappy as one could wish for, really lovely to use within its scope, but doesn't have any idea what a promisor is, and simply says "nah, no content here" when you look at a missing blob. (for the moment) > > the most "blameable" > > files will tend to be the larger ones... :) > > I'm interested in this claim that 'the most "blameable" files will > tend to be the larger ones.' I typically expect blame to be used on > human-readable text files, and my initial reaction is that larger > files are harder to use with 'git blame'. Absolutely, I meant "the larger text/code files", not including other stuff that tends to accumulate in the higher filesize brackets. I meant that I, for one, in this project at least, often find myself using git blame (or equivalent) to "spelunk" into who touched a specific line, in cases where looking at the plain history is useless because there have been many hundreds or thousands of changes - and in my limited experience, files with that many reasons to change tend to be large. > Your concern about slow commands is noted, but also blindly > downloading every file in history will slow the repo due to the > full size of the objects on disk. I have in the past claimed that "larger repo" (specifically, a deeper clone that gets many larger blobs) is slower, but haven't actually found any significant evidence to back my claim. Obviously something like "git gc" will be slower, but is there anything in the practical day-to-day that cares whether the commit depth is 10,000 commits or 200,000 commits for a given branch, or whether you only have the blobs at the "tip" of the branch/project, or all the blobs in history? (besides GC, specifically) > it would be good to design such a feature to have other > custom knobs, such as: > * Get only "recent" history, perhaps with a "--since=<date>" > kind of flag. This would walk commits only to a certain date, > then find all missing blobs reachable from their root trees. As long as you know at initial clone time that this is what you want, combining shallow clone with sparse clone already enables this today (shallow clone, set up filter, unshallow, and potentially remove filter). You can even do more complicated things like unshallowing with different increasingly-aggressive filters in multiple steps/fetches over different time periods. The main challenge that I perceive at the moment is that you're effectively locked into "one shot". As soon as you've retrieved the commits with blobs missing, "filling them in" at scale seems to be orders of magnitude more expensive than an equivalent clone would have been. > If we had a refiltering feature, then you could even > start with a blobless clone to have an extremely fast initial > clone, followed by a background job that downloads the remaining > objects. Yes please! I think one thing that I'm not clearly understanding yet in this conversation, is whether the tax on explicit and specialized blob list fetching could be made much lower. As far as I can tell, in a blobless clone with full trees we have most of the data one could want, to decide what blobs to request - paths, filetypes, and commit dates. This leaves three pain points that I am aware of: * Filesizes are not (afaik) available in a blobless clone. This sounds like a pretty deep limitation, which I'll gloss over. * Blob paths are available in trees, but not trivially exposed by git rev-list - could a new "--missing" option value make sense? Or does it make just as much sense to expect the caller/scripter to iterate ls-tree outputs? (I assume doing so would be much slower, but have not tested) * Something about the "git fetch <remote> blob-hash ..." pattern seems to scale very poorly - is that something that might see change in future, or is it a fundamental issue? Thanks again for the detailed feedback! Tao