On Fri, Jun 17, 2022 at 11:32 AM Marco Elver <elver@xxxxxxxxxx> wrote: > > > The disadvantage: > > > > - If the affected object was allocated/freed long before the bug happened > > and the stack trace events were purged from the stack ring, the report > > will have no stack traces. > > Do you have statistics on how how likely this is? Maybe through > identifying what the average lifetime of an entry in the stack ring is? > > How bad is this for very long lived objects (e.g. pagecache)? I ran a test on Pixel 6: the stack ring of size (32 << 10) gets fully rewritten every ~2.7 seconds during boot. Any buggy object that is allocated/freed and then accessed with a bigger time span will not have stack traces. This can be dealt with by increasing the stack ring size, but this comes down to how much memory one is willing to allocate for the stack ring. If we decide to use sampling (saving stack traces only for every Nth object), that will affect this too. But any object that is allocated once during boot will be purged out of the stack ring sooner or later. One could argue that such objects are usually allocated at a single know place, so have a stack trace won't considerably improve the report. I would say that we need to deploy some solution, study the reports, and adjust the implementation based on that. > > Discussion > > ========== > > > > The current implementation of the stack ring uses a single ring buffer for > > the whole kernel. This might lead to contention due to atomic accesses to > > the ring buffer index on multicore systems. > > > > It is unclear to me whether the performance impact from this contention > > is significant compared to the slowdown introduced by collecting stack > > traces. > > I agree, but once stack trace collection becomes faster (per your future > plans below), this might need to be revisited. Ack. Thanks!