On Wed, Nov 21, 2018 at 6:36 PM Tim Murray <timmurray@xxxxxxxxxx> wrote: > > On Wed, Nov 21, 2018 at 5:29 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Wed, 21 Nov 2018 17:08:08 -0800 Daniel Colascione <dancol@xxxxxxxxxx> wrote: > > > > > Have you done much > > > retrospective long trace analysis? > > > > No. Have you? > > > > Of course you have, which is why I and others are dependent upon you to > > explain why this change is worth adding to Linux. If this thing solves > > a problem which we expect will not occur for anyone between now and the > > heat death of the universe then this impacts our decisions. > > I use ftrace the most on Android, so let me take a shot. > > In addition to the normal "debug a slow thing" use cases for ftrace, > Android has started exploring two other ways of using ftrace: > > 1. "Flight recorder" mode: trigger ftrace for some amount of time when > a particular anomaly is detected to make debugging those cases easier. > > 2. Long traces: let a trace stream to disk for hours or days, then > postprocess it to get some deeper insights about system behavior. > We've used this very successfully to debug and optimize power > consumption. > > Knowing the initial state of the system is a pain for both of these > cases. For example, one of the things I'd like to know in some of my > current use cases for long traces is the current oom_score_adj of > every process in the system, but similar to PID reuse, that can change > very quickly due to userspace behavior. There's also a race between > reading that value in userspace and writing it to trace_marker: > > 1. Userspace daemon X reads oom_score_adj for a process Y. > 2. Process Y gets a new oom_score_adj value, triggering the > oom/oom_score_adj_update tracepoint. > 3. Daemon X writes the old oom_score_adj value to trace_marker. > > As I was writing this, though, I realized that the race doesn't matter > so long as our tools follow the same basic practice (for PID reuse, > oom_score_adj, or anything else we need): > > 1. Daemon enables all requested tracepoints and resets the trace clock. > 2. Daemon enables tracing > 3. Daemon dumps initial state for any tracepoint we care about. > 4. When postprocessing, a tool must consider the initial state of a > value (eg, oom_score_adj of pid X) to be either the initial state as > reported by the daemon or the first ftrace event reporting that value. > If there is an ftrace event in the trace before the report from the > daemon, the report from the daemon should be ignored. I was imagining periodic scans being the moral equivalent of I-frames in an MPEG stream: known-good synchronization points that bound the propagation of errors. To prevent loss of events in the *middle* of one of these samples affecting the validity of the sample, I wanted each one to be atomic and self-contained. I was initially imagining a single proc file containing a mapping from thread ID list and a TGID mapping for each one (as well as assorted other information), generated atomically under task_list_lock and stamped with the trace timestamp. The /proc/pid_gen thing was an effort to do most of that in userspace. But maybe that's not really that important. Another thing to consider: there are some properties which we can enumerate *only* through /proc and that we can't yet receive in a streaming, event-driven fashion. We read those by periodic sampling, but the sampling isn't atomic. And as Tim notes, even properties that we ordinarily receive in streaming form via trace event must be sampled at trace start. But trace start isn't the only case where it matters. Suppose at T+50 we have a TID 601 (which we'll call TID 601_1) with steady-state resource-consumption $COUNTER 100MB. Now suppose we start sampling at T+50. At T+55, TID 601_1 dies. At T+60, due to unlikely be possible TID reuse, we see a new task with TID 601, which we'll call TID 601_2. At T+65, the sample reads TID 601_2's $COUNTER as 1GB. At T+70, we finish the sample. From the perspective of trace analysis, all the sample tells us is "at some point between T+50 and T+70, TID 601's $COUNTER was 1GB". We also know when TID 601_1 died and when TID 601_2 started. We don't know whether the sample's "TID 601" refers to TID 601_1 or TID 601_2. So what do we do? 1) We could attribute the 1GB value of $COUNTER to TID 601_1, which is what would happen by default if we deemed the sample to be a virtual event with timestamp T+50, the time we started the sampling process. But that would be incorrect here. In analysis, we'd see this attribution as a mysterious spike in TID 601_1's resource consumption immediately before its death. (The spike would be to a level suspiciously similar to TID 601_2's next sample.) Imagine spending hours trying to track down a mystery death-rattle resource spike (something that could plausibly happen in the real world) that ended up being a tracing artifact. 2) We could attribute the 1GB read of $COUNTER to TID 601_2. In the timeline I gave above, this attribution would be correct. But what if TID 601_1's $COUNTER really _had_ spiked to 1GB immediately before its death? (What if its death had been caused by that spike?) Then, under the "tag last" approach, we'd falsely tag TID 601_2 with TID 601_1's excessive resource use, and this false attribution would last until the next sample of TID 601_2, which might be quite far in the future. And we might find ourselves wondering why TID 601_1 died, since (if we blame 601_2 for the new $COUNTER value) TID 601_1 would have done nothing wrong. And in the meantime, we'd see 1GB of $COUNTER consumption that would appear to come out of nowhere. 3) We could discard the ambiguous sample. But in this case, we'd leave TID 601_2's $COUNTER value completely unknown until the next sample, which could be quite far in the future. And, if TID 601_2's $COUNTER value really were 1GB, we might leave significant resource consumption unaccounted-for until that next sample. With my pid_gen patch, we could have used strategy #3 without having to wait for the system to get around to sampling TID 601_2 again on its own, because we'd have been able to detect the ambiguity and collect a new, hopefully-unambiguous sample on the spot. Yes, we do wait a sampling period between samples anyway, and it's possible for a process to start, run, and die without having ever been sampled with or without pid_gen, but at least we wouldn't have missed a perfectly good sampling opportunity due to PID ambiguity. You might say, "dancol@, you sound excessively fussy here. That's never going to really matter under any realistic workload." And maybe that'd be right. But like I said above, I think it's important to be able to have deterministic bounds on uncertainty even in the case of pathological behavior, and I'd rather guarantee correctness structurally than wonder whether unlikely events are one-in-a-trace unlikely or heat-death-of-the-universe unlikely. To improve on the sampling model, we could have the kernel stamp $COUNTER reads with the current trace timestamp. Then, the sample would tell us, instead of "TID 601's $COUNTER was 1GB between T+50 and T+70", it'd tell us "TID 601's $COUNTER was 1GB at exactly T+65". (The sampling program itself might get around to logging this bit of information at T+70 or T+700, but the embedded timestamp would be accurate.) We could then look up the inferred process table at T+65 and attribute that 1GB to TID 601_2. Ideally, we wouldn't bother with this sort of sampling at all, instead relying on streaming $COUNTER updates as events. But we're not quite there yet. I have a larger proposal in the works for that one. Under that proposal, we wouldn't log every change to $COUNTER, but we would log $COUNTER at process death (as well as at other times --- basically, after large changes and after a maximum time after any change), so trace analysis would still be able to correctly determine TID 601_1's time-of-death resource use. In another approach, the trace system *itself* would generate the "I-frames" by automatically generating synthetic trace events at trace startup in order to reflect pre-trace system state. (For example, if you enabled sched_switch and started tracing, the ftrace ring buffer would begin life with one synthetic sched_switch per CPU describing each CPU's running task at the instant of trace startup, maybe collected with on_all_cpus.) But anyway, I didn't expect the pid_gen change to be controversial. Given that we can eventually do something different altogether and that it probably *is* excessively fussy, consider the patch withdrawn.