Jeff King <peff@xxxxxxxx> 于2023年4月11日周二 04:14写道: > > On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote: > > > > Perhaps slightly so, since there is naturally going to be some > > > duplicated effort spawning processes, loading any shared libraries, > > > initializing the repository and reading its configuration, etc. > > > > > > But I'd wager that these are all a negligible cost when compared to the > > > time we'll have to spend reading, inflating, and printing out all of the > > > objects in your repository. > > > > "What you said makes sense. I implemented the --type-filter option for > > git cat-file and compared the performance of outputting all blobs in the > > git repository with and without using the type-filter. I found that the > > difference was not significant. > > > > time git cat-file --batch-all-objects --batch-check="%(objectname) > > %(objecttype)" | > > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null > > 17.10s user 0.27s system 102% cpu 16.987 total > > > > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null > > 16.74s user 0.19s system 95% cpu 17.655 total > > > > At first, I thought the processes that provide all blob oids by using > > git rev-list or git cat-file --batch-all-objects --batch-check might waste > > cpu, io, memory resources because they need to read a large number > > of objects, and then they are read again by git cat-file --batch. > > However, it seems that this is not actually the bottleneck in performance. > > Yeah, I think most of your time there is spent on the --batch command > itself, which is just putting through a lot of bytes. You might also try > with "--unordered". The default ordering for --batch-all-objects is in > sha1 order, which has pretty bad locality characteristics for delta > caching. Using --unordered goes in pack-order, which should be optimal. > > E.g., in git.git, running: > > time \ > git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' | > perl -lne 'print $1 if /^blob (.*)/' | > git cat-file --batch >/dev/null > > takes: > > real 0m29.961s > user 0m29.128s > sys 0m1.461s > > Adding "--unordered" to the initial cat-file gives: > > real 0m1.970s > user 0m2.170s > sys 0m0.126s > > So reducing the size of the actual --batch printing may make the > relative cost of using multiple processes much higher (I didn't apply > your --type-filter patches to test myself). > You are right. Adding the --unordered option can avoid the time-consuming sorting process from affecting the test results. time git cat-file --unordered --batch-all-objects \ --batch-check="%(objectname) %(objecttype)" | \ awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null 4.17s user 0.23s system 109% cpu 4.025 total time git cat-file --unordered --batch-all-objects --batch --type-filter=blob >/dev/null 3.84s user 0.17s system 97% cpu 4.099 total It looks like the difference is not significant either. After all, the truly time-consuming process is reading the entire data of the blob, whereas git cat-file --batch-check only reads the first few bytes of the object in comparison. > In general, I do think having a processing pipeline like this is OK, as > it's pretty flexible. But especially for smaller queries (even ones that > don't ask for the whole object contents), the per-object lookup costs > can start to dominate (especially in a repository that hasn't been > recently packed). Right now, even your "--batch --type-filter" example > is probably making at least two lookups per object, because we don't > have a way to open a "handle" to an object to check its type, and then > extract the contents conditionally. And of course with multiple > processes, we're naturally doing a separate lookup in each one. > Yes, the type of the object is encapsulated in the header of the loose object file or the object entry header of the pack file. We have to read it to get the object type. This may be a lingering question I have had: why does git put the type/size in the file data instead of storing it as some kind of metadata elsewhere? > So a nice thing about being able to do the filtering in one process is > that we could _eventually_ do it all with one object lookup. But I'd > probably wait on adding something like --type-filter until we have an > internal single-lookup API, and then we could time it to see how much > speedup we can get. > I am highly skeptical of this "internal single-lookup API". Do we really need an extra metadata table to record all objects? Something like: metadata: {oid: type, size}? > -Peff ZheNing Hu