On Wed, Jun 21, 2023 at 12:10:33PM +0200, Tao Klerks wrote: > > This is not very efficient, but: > > > > git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered | > > grep $some_sha1 > > > > will tell you whether we have the object locally. > > > > Thanks so much for your help! > > in Windows (msys or git bash) this is still very slow in my repo with > 6,500,000 local objects - around 60s - but in linux on the same repo > it's quite a lot faster, at 5s. A large proportion of my users are on > Windows though, so I don't think this will be "good enough" for my > purposes, when I often need to check for the existence of dozens or > even hundreds of commits. Yeah, it's just a lot of object names to print, most of which you don't care about. :) The more efficient thing would be to open the actual pack .idx files and look for the names via binary search. I don't think you can convince git to do that, though I suspect you could write a trivial libgit2 program that does. > > I don't work with partial clones often, but it feels like being able to > > say: > > > > git --no-partial-fetch cat-file ... > > > > would be a useful primitive to have. > > It feels that way to me, yes! > > On the other hand, I find very little demand for it when I search "the > internet" - or I don't know how to search for it. I think partial clones are still new enough that not many people are using them heavily. And when they do, not managing the partial state at a very advanced level; I think tools for pruning locally cached objects (which you could refetch) is only just being worked on now. > > It does seem like you might be able to bend it to > > your will here, though. I think without any patches that: > > > > git rev-list --objects --exclude-promisor-objects $oid > > > > will tell you whether we have the object or not (since it turns off > > fetch_if_missing, and thus will either succeed, printing nothing, or > > bail if the object can't be found). > > This behaves in a way that I don't understand: > > In the repo that I'm working in, this command runs successfully > *without fetching*, but it takes a *very* long time - 300+ seconds - > much longer than even the "inefficient" 'cat-file'-based printing of > all (6.5M) local object ids that you proposed above. I haven't > attempted to understand what's going on in there (besides running with > GIT_TRACE2_PERF, which showed nothing interesting), but the idea that > git would have to work super-hard to find an object by its ID seems > counter to everything I know about it. Would there be value in my > trying to understand & reproduce this in a shareable repo, or is there > already an explanation as to why this command could/should ever do > non-trivial work, even in the largest partial repos? I think it's actually doing the gigantic traversal (and just limiting it when it sees objects that are not available). You probably want "--no-walk" at least, but really you don't even want to walk the trees of any commits you specify (so you'd want to omit "--objects" if you are asking about a commit, and otherwise include it, which is slightly awkward). > > It feels like --missing=error should > > function similarly, but it seems to still lazy-fetch (I guess since it's > > the default, the point is to just find truly unavailable objects). Using > > --missing=print disables the lazy-fetch, but it seems to bail > > immediately if you ask it about a missing object (I didn't dig, but my > > guess is that --missing is mostly about objects we traverse, not the > > initial tips). > > Woah, "--missing=print" seems to work!!! > > The following gives me the commit hash if I have it locally, and an > error otherwise - consistently across linux and windows, git versions > 2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy > CPU-churning: > > git rev-list --missing=print -1 $oid > > Thank you thank you thank you! Hmph, I thought I tried that before and it didn't work, but it seems to work for me now. I guess I was hoping to have it print the missing object rather than exiting with an error, but if you do one object at a time then the error is sufficient signal. :) You might want "--objects" if you're going to ask about non-commits. Though it might not be necessary. I suspect Git would bail trying to look up the object in the first place if we don't have it, and if we do have it then it just becomes a silent noop. > I feel like I should try to work something into the doc about this, > but I'm not sure how to express this: "--missing=error is the default, > but it doesn't actually error out when you're explicitly asking about > a missing commit, it fetches it instead - but --missing=print actually > *does* error out if you explicitly ask about a missing commit" seems > like a strange thing to be saying. I think we are relying on the side effect that everything except --missing=error will turn off auto-fetching. I don't know if that's something we'd want to document. It seems reasonable to me that we might later change the implementation so that we kick in the --missing behavior only after parsing the initial list of traversal tips (I mean, I don't know why we would do that in particular, but it seems like the kind of thing we'd want to reserve as an implementation detail subject to change). I do think in the long run that a big "--do-not-lazy-fetch" flag would be the right solution to let the user tell us what they want. > Thanks again for finding me an efficient working strategy here! I'm glad it worked. I was mostly just thinking out loud. ;) -Peff