On Mon, 2020-10-05 at 01:14 +0200, Frederic Weisbecker wrote: > Speaking of which, I agree with Thomas that it's unnecessary. > > > It's > > > too much > > > code and complexity. We can use the existing trace events and > > > perform > > > the > > > analysis from userspace to find the source of the disturbance. > > > > The idea behind this is that isolation breaking events are supposed > > to > > be known to the applications while applications run normally, and > > they > > should not require any analysis or human intervention to be > > handled. > > Sure but you can use trace events for that. Just trace interrupts, > workqueues, > timers, syscalls, exceptions and scheduler events and you get all the > local > disturbance. You might want to tune a few filters but that's pretty > much it. And keep all tracing enabled all the time, just to be able to figure out that disturbance happened at all? Or do you mean that we can use kernel entry mechanism to reliably determine that isolation breaking event happened (so the isolation- breaking procedure can be triggered as early as possible), yet avoid trying to determine why exactly it happened, and use tracing if we want to know? Original patch did the opposite, it triggered any isolation-breaking procedure only once it was known specifically, what kind of event happened -- a hardware interrupt, IPI, syscall, page fault, or any other kind of exception, possibly something architecture-specific. This, of course, always had a potential problem with coverage -- if handling of something is missing, isolation breaking is not handled at all, and there is no obvious way of finding if we covered everything. This also made the patch large and somewhat ugly. When I have added a mechanism for low-level isolation breaking handling on kernel entry, it also partially improved the problem with completeness. Partially because I have not yet added handling of "unknown cause" before returning to userspace, however that would be a logical thing to do. Then if we entered kernel from isolation, did something, and are returning to userspace still not knowing what kind of isolation-breaking event happened, we can still trigger isolation breaking. Did I get it right, and you mean that we can remove all specific handling of isolation breaking causes, except for syscall that exits isolation, and report isolation breaking instead of normally returning to userspace? Then isolation breaking will be handled reliably without knowing the cause, and we can leave determining the cause to the tracing mechanism (if enabled)? This does make sense. However for me it looks somewhat strange, because I assume isolation breaking to be a kind of runtime error, that userspace software is supposed to get some basic information about -- like, signals distinguishing between, say, SIGSEGV and SIGPIPE, or write() being able to set errno to ENOSPC or EIO. Then userspace receives basic information about the cause of exception or error, and can do some meaningful reporting, or decide if the error should be fatal for the application or handled differently, based on its internal logic. To get those distinctions, application does not have to be aware of anything internal to the kernel. Similarly distinguishing between, say, a page fault, device interrupt and a timer may be important for a logic implemented in userspace, and I think, it may be nice to allow userspace to get this information immediately and without being aware of any additional details of kernel implementation. The current patch doesn't do this yet, however the intention is to implement reliable isolation breaking by checking on userspace re-entry, plus make reporting of causes, if any were found, visible to the userspace in some convenient way. The part that determines the cause can be implemented separately from isolation breaking mechanism. Then we can have isolation breaking on kernel entry (or potentially some other condition on kernel entry that requires logging the cause) enable reporting, then reporting mechanism, if it exists will fill the blanks, and once either cause is known, or it's time to return to userspace, notification will be done with whatever information is available. For some in-depth analysis, if necessary for debugging the kernel, we can have tracing check if we are in this "suspicious kernel entry" mode, and log things that otherwise would not be. > As for the source of the disturbances, if you really need that > information, > you can trace the workqueue and timer queue events and just filter > those that > target your isolated CPUs. For the purpose of human debugging the kernel or application, the more information is (usually) the better, so the only concern here is that now user is responsible for completeness of things he is tracing. However from application's point of view, or for logging in a production environment it's usually more important to get general type of events, so it's possible to, say, confirm that nothing "really bad" happened, or to trigger the emergency response if it did. Say, if the only causes of isolation breaking was IPI within few moments of application startup, or signal from somewhere else when application was restarted, there is no cause for concern. However if hardware interrupts arrive at random points in time, something is clearly wrong. And if page faults happen, most likely application forgot to page-in and lock its address space. Again, in my opinion this is not unlike reporting ENOSPC vs. EIO while doing file I/O -- the former (usually) indicates a common problem that may require application-level cleanup, the latter (also usually) means that something is seriously wrong. > > A process may exit isolation because some leftover delayed work, > > for > > example, a timer or a workqueue, is still present on a CPU, or > > because > > a page fault or some other exception, normally handled silently, is > > caused by the task. It is also possible to direct an interrupt to a > > CPU > > that is running an isolated task -- currently it's perfectly valid > > to > > set interrupt smp affinity to a CPU running isolated task, and then > > interrupt will cause breaking isolation. While it's probably not > > the > > best way of handling interrupts, I would rather not prohibit this > > explicitly. > > Sure, but you can trace all these events with the existing tracing > interface we have. Right. However it would require someone to intentionally do tracing of all those events, all for the purpose of obtaining a type of runtime error. As an embedded systems developer, who had to look for signs of unusual bugs on a large number of customers' systems, and had to distinguish them from reports of hardware malfunctions, I would prefer something clearly identifiable in the logs (of kernel, application, or anything else) when no one is specifically investigating any problem. When anything suspicious happens, often the system is physically unreachable, and the problem may or may not happen again, so the first report from a running system may be the only thing available. When everything is going well, the same systems more often have hardware failures than report valid software bugs (or, ideally, all reports are from hardware failures), so it's much better to know that if software will do something wrong, it would be possible to identify the problem from the first report, rather than guess. Sometimes equipment gets firmware updates many years after production, when there are reports of all kinds of failures due to mechanical or thermal damage, faulty parts, bad repair work, deteriorating flash, etc. Among those there might be something that indicates new bugs made by a new generation of developers (occasionally literally), regressions, etc. In those situations getting useful information from the error message in the first report can make a difference between quickly identifying the problem and going on a wild goose chase. -- Alex