On Fri, Mar 16, 2018 at 08:48:49PM +0100, SZEDER Gábor wrote: > I came up with a different explanation back then: we are only interested > in commit objects when creating the commit graph, and only a small-ish > fraction of all objects are commit objects, so the "enumerate objects in > packfiles" approach has to look at a lot more objects: > > # in my git fork > $ git rev-list --all --objects |cut -d' ' -f1 |\ > git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size > $ grep -c ^commit type-size > 53754 > $ wc -l type-size > 244723 type-size > > I.e. only about 20% of all objects are commit objects. > > Furthermore, in order to look at an object it has to be zlib inflated > first, and since commit objects tend to be much smaller than trees and > especially blobs, there are a lot less bytes to inflate: > > $ grep ^commit type-size |cut -d' ' -f2 |avg > 34395730 / 53754 = 639 > $ cat type-size |cut -d' ' -f2 |avg > 3866685744 / 244723 = 15800 > > So a simple revision walk inflates less than 1% of the bytes that the > "enumerate objects packfiles" approach has to inflate. I don't think this is quite accurate. It's true that we have to _consider_ every object, but Git is smart enough not to inflate each one to find its type. For loose objects we just inflate the header. For packed objects, we either pick the type directly out of the packfile header (for a non-delta) or can walk the delta chain (without actually looking at the data bytes!) until we hit the base. So starting from scratch: git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | grep ^commit | cut -d' ' -f2 | git cat-file --batch is in the same ballpark for most repos as: git rev-list --all | git cat-file --batch though in my timings the traversal is a little bit faster (and I'd expect that to remain the case when doing it all in a single process, since the traversal only follows commit links, whereas processing the object list has to do the type lookup for each object before deciding whether to inflate it). I'm not sure, though, if that edge would remain for incremental updates. For instance, after we take in some new objects via "fetch", the traversal strategy would want to do something like: git rev-list $new_tips --not --all | git cat-file --batch whose performance will depend on the refs _currently_ in the repository, as we load them as UNINTERESTING tips for the walk. Whereas doing: git show-index <.git/objects/pack/the-one-new-packfile.idx | cut -d' ' -f2 | git cat-file --batch-check='%(objecttype) %(objectname)' | grep ^commit | cut -d' ' -f2 | git cat-file --batch always scales exactly with the size of the new objects (obviously that's kind of baroque and this would all be done internally, but I'm trying to demonstrate the algorithmic complexity). I'm not sure what the plan would be if we explode loose objects, though. ;) -Peff