On Tue, Apr 11, 2023 at 10:09:33PM +0800, ZheNing Hu wrote: > > So reducing the size of the actual --batch printing may make the > > relative cost of using multiple processes much higher (I didn't apply > > your --type-filter patches to test myself). > > > > You are right. Adding the --unordered option can avoid the > time-consuming sorting process from affecting the test results. Just to be clear: it's not the cost of sorting, but rather that accessing the object contents in a sub-optimal order is much worse (and that sub-optimal order happens to be "sorted by sha1", since that is effectively random with respect to the contents). > time git cat-file --unordered --batch-all-objects \ > --batch-check="%(objectname) %(objecttype)" | \ > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > > 4.17s user 0.23s system 109% cpu 4.025 total > > time git cat-file --unordered --batch-all-objects --batch > --type-filter=blob >/dev/null > > 3.84s user 0.17s system 97% cpu 4.099 total > > It looks like the difference is not significant either. OK, good, that means we can probably not worry about it. :) > > In general, I do think having a processing pipeline like this is OK, as > > it's pretty flexible. But especially for smaller queries (even ones that > > don't ask for the whole object contents), the per-object lookup costs > > can start to dominate (especially in a repository that hasn't been > > recently packed). Right now, even your "--batch --type-filter" example > > is probably making at least two lookups per object, because we don't > > have a way to open a "handle" to an object to check its type, and then > > extract the contents conditionally. And of course with multiple > > processes, we're naturally doing a separate lookup in each one. > > > > Yes, the type of the object is encapsulated in the header of the loose > object file or the object entry header of the pack file. We have to read > it to get the object type. This may be a lingering question I have had: > why does git put the type/size in the file data instead of storing it as some > kind of metadata elsewhere? It's not just metadata; it's actually part of what we hash to get the object id (though of course it doesn't _have_ to be stored in a linear buffer, as the pack storage shows). But for loose objects, where would such metadata be? And accessing it isn't too expensive; we only zlib inflate the first few bytes (the main cost is in the syscalls to find and open the file). For packed object, it effectively is metadata, just stuck at the front of the object contents, rather than in a separate table. That lets us use the same .idx file for finding that metadata as we do for the contents themselves (at the slight cost that if you're _just_ accessing metadata, the results are sparser within the file, which has worse behavior for cold-cache disks). But when I say that lookup costs dominate, what I mean is that we'd spend a lot of our time binary searching within the pack .idx file, or falling back to syscalls to look for loose objects. > > So a nice thing about being able to do the filtering in one process is > > that we could _eventually_ do it all with one object lookup. But I'd > > probably wait on adding something like --type-filter until we have an > > internal single-lookup API, and then we could time it to see how much > > speedup we can get. > > I am highly skeptical of this "internal single-lookup API". Do we really > need an extra metadata table to record all objects? > Something like: metadata: {oid: type, size}? No, I don't mean changing the storage at all. I mean that rather than doing this: /* get type, size, etc, for --batch format */ type = oid_object_info(&oid, &size); /* now get the contents for --batch to write them itself; but note * that this searches for the entry again within all packs, etc */ contents = read_object_file(oid, &type, &size); as the cat-file code now does (because the first call is in batch_object_write(), and the latter in print_object_or_die()), they could be a single call that does the lookup once. We could actually do that today, since the object contents are eventually fed from oid_object_info_extended(), and we know ahead of time that we want both the metadata and the contents. But that wouldn't work if we filtered by type, etc. I'm not sure how much of a speedup it would yield in practice, though. If you're printing the object contents, then the extra lookup is probably not that expensive by comparison. -Peff