Re: [PATCH v2] Add /proc/pid_gen

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Wed, 21 Nov 2018 16:57:41 -0800

On Wed, 21 Nov 2018 16:28:56 -0800 Daniel Colascione <dancol@xxxxxxxxxx> wrote:

> > > The problem here is the possibility of confusion, even if it's rare.
> > > Does the naive approach of just walking /proc and ignoring the
> > > possibility of PID reuse races work most of the time? Sure. But "most
> > > of the time" isn't good enough. It's not that there are tons of sob
> > > stories: it's that without completely robust reporting, we can't rule
> > > out of the possibility that weirdness we observe in a given trace is
> > > actually just an artifact from a kinda-sort-working best-effort trace
> > > collection system instead of a real anomaly in behavior. Tracing,
> > > essentially, gives us deltas for system state, and without an accurate
> > > baseline, collected via some kind of scan on trace startup, it's
> > > impossible to use these deltas to robustly reconstruct total system
> > > state at a given time. And this matters, because errors in
> > > reconstruction (e.g., assigning a thread to the wrong process because
> > > the IDs happen to be reused) can affect processing of the whole trace.
> > > If it's 3am and I'm analyzing the lone trace from a dogfooder
> > > demonstrating a particularly nasty problem, I don't want to find out
> > > that the trace I'm analyzing ended up being useless because the
> > > kernel's trace system is merely best effort. It's very cheap to be
> > > 100% reliable here, so let's be reliable and rule out sources of
> > > error.
> >
> > So we're solving a problem which isn't known to occur, but solving it
> > provides some peace-of-mind?  Sounds thin!
> 
> So you want to reject a cheap fix for a problem that you know occurs
> at some non-zero frequency? There's a big difference between "may or
> may not occur" and "will occur eventually, given enough time, and so
> must be taken into account in analysis". Would you fix a refcount race
> that you knew was possible, but didn't observe? What, exactly, is your
> threshold for accepting a fix that makes tracing more reliable?

Well for a start I'm looking for a complete patch changelog.  One which
permits readers to fully understand the user-visible impact of the
problem.

If it is revealed that is a theoretical problem which has negligible
end-user impact then sure, it is rational to leave things as they are. 
That's what "negligible" means!