On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote: > > Perhaps slightly so, since there is naturally going to be some > > duplicated effort spawning processes, loading any shared libraries, > > initializing the repository and reading its configuration, etc. > > > > But I'd wager that these are all a negligible cost when compared to the > > time we'll have to spend reading, inflating, and printing out all of the > > objects in your repository. > > "What you said makes sense. I implemented the --type-filter option for > git cat-file and compared the performance of outputting all blobs in the > git repository with and without using the type-filter. I found that the > difference was not significant. > > time git cat-file --batch-all-objects --batch-check="%(objectname) > %(objecttype)" | > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > 17.10s user 0.27s system 102% cpu 16.987 total > > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null > 16.74s user 0.19s system 95% cpu 17.655 total > > At first, I thought the processes that provide all blob oids by using > git rev-list or git cat-file --batch-all-objects --batch-check might waste > cpu, io, memory resources because they need to read a large number > of objects, and then they are read again by git cat-file --batch. > However, it seems that this is not actually the bottleneck in performance. Yeah, I think most of your time there is spent on the --batch command itself, which is just putting through a lot of bytes. You might also try with "--unordered". The default ordering for --batch-all-objects is in sha1 order, which has pretty bad locality characteristics for delta caching. Using --unordered goes in pack-order, which should be optimal. E.g., in git.git, running: time \ git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | perl -lne 'print $1 if /^blob (.*)/' | git cat-file --batch >/dev/null takes: real 0m29.961s user 0m29.128s sys 0m1.461s Adding "--unordered" to the initial cat-file gives: real 0m1.970s user 0m2.170s sys 0m0.126s So reducing the size of the actual --batch printing may make the relative cost of using multiple processes much higher (I didn't apply your --type-filter patches to test myself). In general, I do think having a processing pipeline like this is OK, as it's pretty flexible. But especially for smaller queries (even ones that don't ask for the whole object contents), the per-object lookup costs can start to dominate (especially in a repository that hasn't been recently packed). Right now, even your "--batch --type-filter" example is probably making at least two lookups per object, because we don't have a way to open a "handle" to an object to check its type, and then extract the contents conditionally. And of course with multiple processes, we're naturally doing a separate lookup in each one. So a nice thing about being able to do the filtering in one process is that we could _eventually_ do it all with one object lookup. But I'd probably wait on adding something like --type-filter until we have an internal single-lookup API, and then we could time it to see how much speedup we can get. -Peff