On Mon, Mar 01, 2021 at 01:20:26PM +0100, Patrick Steinhardt wrote: > Altogether, this ends up with the following queries, both of which have > been executed in a well-packed linux.git repository: > > # Previous query which uses object names as a heuristic to filter > # non-blob objects, which bars us from using bitmap indices because > # they cannot print paths. > $ time git rev-list --objects --filter=blob:limit=200 \ > --object-names --all | sed -r '/^.{,41}$/d' | wc -l > 4502300 > > real 1m23.872s > user 1m30.076s > sys 0m6.002s > > # New query. > $ time git rev-list --objects --filter-provided \ > --filter=object:type=blob --filter=blob:limit=200 \ > --use-bitmap-index --all | wc -l > 22585 > > real 0m19.216s > user 0m16.768s > sys 0m2.450s Those produce very different answers. I guess because in the first one, you still have a bunch of tree objects, too. You'd do much better to get the actual types from cat-file, and filter on that. That also lets you use bitmaps for the traversal portion. E.g.: $ time git rev-list --use-bitmap-index --objects --filter=blob:limit=200 --all | git cat-file --buffer --batch-check='%(objecttype) %(objectname)' | perl -lne 'print $1 if /^blob (.*)/' | wc -l 14966 real 0m6.248s user 0m7.810s sys 0m0.440s which is faster than what you showed above (this is on linux.git, but my result is different; maybe you have more refs than me?). But we should be able to do better purely internally, so I suspect my computer is just faster (or maybe your extra refs just aren't well-covered by bitmaps). Running with your patches I get: $ time git rev-list --objects --use-bitmap-index --all \ --filter-provided --filter=object:type=blob \ --filter=blob:limit=200 | wc -l 16339 real 0m1.309s user 0m1.234s sys 0m0.079s which is indeed faster. It's quite curious that the answer is not the same, though! I think yours has some bugs. If I sort and diff the results, I see some commits mentioned in the output. Perhaps this is --filter-provided not working, as they all seem to be ref tips. > To be able to more efficiently answer this query, I've implemented > multiple things: > > - A new object type filter `--filter=object:type=<type>` for > git-rev-list(1), which is implemented both for normal graph walks and > for the packfile bitmap index. > > - Given that above usecase requires two filters (the object type > and blob size filters), bitmap filters were extended to support > combined filters. That's probably reasonable, especially because it lets us use bitmaps. I do have a dream that we'll eventually be able to support more extensive formatting via log/rev-list, which would allow: git rev-list --use-bitmap-index --objects --all \ --format=%(objecttype) %(objectname) | perl -ne 'print $1 if /^blob (.*)/' That should be faster than the separate cat-file (which has to re-lookup each object, in addition to the extra pipe overhead), but I expect the --filter solution should always be faster still, as it can very quickly eliminate the majority of the objects at the bitmap level. > - git-rev-list(1) doesn't filter user-provided objects and always prints > them. I don't want the listed commits though and only their referenced > potential LFS blobs. So I've added a new flag `--filter-provided` > which marks all provided objects as not-user-provided such that they > get filtered the same as all the other objects. Yeah, this "user-provided" behavior was quite a surprise to me when I started implementing the bitmap versions of the existing filters. It's nice to have the option to specify which you want. -Peff