Re: [PATCH 0/9] Trace2 stopwatch timers and global counters

Matheus Tavares <matheus.bernardino@xxxxxx> · Tue, 21 Dec 2021 20:27:58 -0300

On Tue, Dec 21, 2021 at 11:51 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 12/20/2021 10:01 AM, Jeff Hostetler via GitGitGadget wrote:
> >
> >  3. Rationale
> >
> > Timers and counters are an alternative to the existing "region" and "data"
> > events. The latter are intended to trace the major flow (or phases) of the
> > program and possibly capture the amount of work performed within a loop, for
> > example. The former are offered as a way to measure activity that is not
> > localized, such as the time spent in zlib or lstat, which may be called from
> > many different parts of the program.
>
> I'm excited for these API features.

Me too! This would have been very useful on some experiments I had to
run in the past.

Thanks for working on it, Jeff :)

> I also like your attention to thread contexts. I think these timers
> would be very interesting to use in parallel checkout. CC'ing Matheus
> for his thoughts on where he would want timer summaries for that
> feature.

For parallel checkout, I think it would be interesting to have timer
summaries for open/close, fstat/lstat, write, and
inflation/delta-reconstruction. Perhaps pkt-line routines too, so that
we can see how much time we spend in inter-process communication.

It would be nice to have timer information for disk reading as well
(more on that below), but I don't think it is possible since we read
the objects through mmap() and thus, we cannot easily isolate the
actual reading time from the decompression time :(

> I would probably want the per-thread summary to know if we
> are blocked on one really long thread while the others finish quickly.

That would be interesting. Parallel checkout actually uses
subprocesses, but I can see the per-thread summary being useful on
grep, for example. (Nevertheless, the use case you mentioned for the
timers -- to evaluate the work balance on parallel checkout -- seems
very interesting.)

> Within that: what are the things causing us to be slow? Is it zlib?
> Is it lstat()?

On my tests, the bottleneck on checkout heavily depended on the
underlying storage type. On HDDs, the bottleneck was object reading
(i.e. page faults on mmap()-ed files), with about 70% to 80% of the
checkout runtime.

On SSDs, reading was much faster, so CPU (i.e. inflation) became the
bottleneck, with 50% of the runtime. (Inflation only lost to reading
when checking out from *many* loose objects.)

Finally, on NFS, file creation with open(O_CREAT | O_EXCL) and fstat()
(which makes the NFS client flush previously cached writes to the
server) were the bottlenecks, with about 40% of the total runtime
each.

These numbers come from a (sequential) `git checkout .` execution on
an empty working tree of the Linux kernel (v5.12), and they were
gathered using eBPF-based profilers. For other operations, especially
ones that require many file removals or more laborious tree merging in
unpack_trees(), I suspect the bottlenecks may change.

If anyone would be interested in seeing the flamegraphs and other
plots for these profiling numbers, I have them at:
https://matheustavares.gitlab.io/annexes/parallel-checkout/profiling

And there is a bit more context at:
https://matheustavares.gitlab.io/posts/parallel-checkout