On Mon, Jul 15, 2019 at 3:01 PM Paul E. McKenney <paulmck@xxxxxxxxxxxxx> wrote: > > On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote: > > On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote: > > > On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote: > > > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote: > > > > > But short term I don't see any other solution than stop testing > > > > > sched_setattr because it does not check arguments enough to prevent > > > > > system misbehavior. Which is a pity because syzkaller has found some > > > > > bad misconfigurations that were oversight on checking side. > > > > > Any other suggestions? > > > > > > > > Or maybe syzkaller can put its own limitations on what parameters are > > > > sent to sched_setattr? In practice, there are any number of ways a > > > > root user can shoot themselves in the foot when using sched_setattr or > > > > sched_setaffinity, for that matter. I imagine there must be some such > > > > constraints already --- or else syzkaller might have set a kernel > > > > thread to run with priority SCHED_BATCH, with similar catastrophic > > > > effects --- or do similar configurations to make system threads > > > > completely unschedulable. > > > > > > > > Real time administrators who know what they are doing --- and who know > > > > that their real-time threads are well behaved --- will always want to > > > > be able to do things that will be catastrophic if the real-time thread > > > > is *not* well behaved. I don't it is possible to add safety checks > > > > which would allow the kernel to automatically detect and reject unsafe > > > > configurations. > > > > > > > > An apt analogy might be civilian versus military aircraft. Most > > > > airplanes are designed to be "inherently stable"; that way, modulo > > > > buggy/insane control systems like on the 737 Max, the airplane will > > > > automatically return to straight and level flight. On the other hand, > > > > some military planes (for example, the F-16, F-22, F-36, the > > > > Eurofighter, etc.) are sometimes designed to be unstable, since that > > > > way they can be more maneuverable. > > > > > > > > There are use cases for real-time Linux where this flexibility/power > > > > vs. stability tradeoff is going to argue for giving root the > > > > flexibility to crash the system. Some of these systems might > > > > literally involve using real-time Linux in military applications, > > > > something for which Paul and I have had some experience. :-) > > > > > > > > Speaking of sched_setaffinity, one thing which we can do is have > > > > syzkaller move all of the system threads to they run on the "system > > > > CPU's", and then move the syzkaller processes which are testing the > > > > kernel to be on the "system under test CPU's". Then regardless of > > > > what priority the syzkaller test programs try to run themselves at, > > > > they can't crash the system. > > > > > > > > Some real-time systems do actually run this way, and it's a > > > > recommended configuration which is much safer than letting the > > > > real-time threads take over the whole system: > > > > > > > > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application > > > > > > Good point! We might still have issues with some per-CPU kthreads, > > > but perhaps use of nohz_full would help at least reduce these sorts > > > of problems. (There could still be issues on CPUs with more than > > > one runnable threads.) > > > > I looked at testing limitations in a bit more detail from an RCU > > viewpoint, and came up with the following rough rule of thumb (which of > > course might or might not survive actual testing experience, but should at > > least be a good place to start). I believe that the sched_setaffinity() > > testing rule should be that the SCHED_DEADLINE cycle be no more than > > two-thirds of the RCU CPU stall warning timeout, which defaults to 21 > > seconds in mainline and 60 seconds in many distro kernels. > > > > That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when > > testing mainline on the one hand or 40 seconds when testing enterprise > > distros on the other. > > > > This assumes quite a bit, though: > > > > o The system has ample memory to spare, and isn't running a > > callback-hungry workload. For example, if you "only" have 100MB > > of spare memory and you are also repeatedly and concurrently > > expanding (say) large source trees from tarballs and then deleting > > those source trees, the system might OOM. The reason OOM might > > happen is that each close() of a file generates an RCU callback, > > and 40 seconds worth of waiting-for-a-grace-period structures > > takes up a surprisingly large amount of memory. > > > > So please be careful when combining tests. ;-) > > > > o There are no aggressive real-time workloads on the system. > > The reason for this is that RCU is going to start sending IPIs > > halfway to the RCU CPU stall timeout, and, in certain situations > > on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations > > constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully > > calibrated abuse is what stress testing is all about.) > > > > o The various RCU kthreads will get a chance to run at least once > > during the SCHED_DEADLINE cycle. If in real life, they only > > get a chance to run once per two SCHED_DEADLINE cycles, then of > > course the 14 seconds becomes 7 and the 40 seconds becomes 20. > > And there are configurations and workloads that might require division > by three, so that (assuming one chance to run per cycle), the 14 seconds > becomes about 5 and the 40 seconds becomes about 15. > > > o The current RCU CPU stall warning defaults remain in > > place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT > > Kconfig parameter, which may in turn be overridden by the > > rcupdate.rcu_cpu_stall_timeout kernel boot parameter. > > > > o The current SCHED_DEADLINE default for providing spare cycles > > for other uses remains in place. > > > > o Other kthreads might have other constraints, but given that you > > were seeing RCU CPU stall warnings instead of other failures, > > the needs of RCU's kthreads seem to be a good place to start. > > > > Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE > > cycle be no more than 14 seconds when testing mainline kernels on the one > > hand and 40 seconds when testing enterprise distro kernels on the other. > > > > Dmitry, does that help? > > I checked with the people running the Linux Plumbers Conference Scheduler > Microconference, and they said that they would welcome a proposal on > this topic, which I have submitted (please see below). Would anyone > like to join as co-conspirator? > > Thanx, Paul > > ------------------------------------------------------------------------ > > Title: Making SCHED_DEADLINE safe for kernel kthreads > > Abstract: > > Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() > that can result in SCHED_DEADLINE tasks starving RCU's kthreads for > extended time periods, not millisecond, not seconds, not minutes, not even > hours, but days. Given that RCU CPU stall warnings are issued whenever > an RCU grace period fails to complete within a few tens of seconds, > the system did not suffer silently. Although one could argue that people > should avoid abusing sched_setattr(), people are human and humans make > mistakes. Responding to simple mistakes with RCU CPU stall warnings is > all well and good, but a more severe case could OOM the system, which > is a particularly unhelpful error message. > > It would be better if the system were capable of operating reasonably > despite such abuse. Several approaches have been suggested. > > First, sched_setattr() could recognize parameter settings that put > kthreads at risk and refuse to honor those settings. This approach > of course requires that we identify precisely what combinations of > sched_setattr() parameters settings are risky, especially given that there > are likely to be parameter settings that are both risky and highly useful. > > Second, in theory, RCU could detect this situation and take the "dueling > banjos" approach of increasing its priority as needed to get the CPU time > that its kthreads need to operate correctly. However, the required amount > of CPU time can vary greatly depending on the workload. Furthermore, > non-RCU kthreads also need some amount of CPU time, and replicating > "dueling banjos" across all such Linux-kernel subsystems seems both > wasteful and error-prone. Finally, experience has shown that setting > RCU's kthreads to real-time priorities significantly harms performance > by increasing context-switch rates. > > Third, stress testing could be limited to non-risky regimes, such that > kthreads get CPU time every 5-40 seconds, depending on configuration > and experience. People needing risky parameter settings could then test > the settings that they actually need, and also take responsibility for > ensuring that kthreads get the CPU time that they need. (This of course > includes per-CPU kthreads!) > > Fourth, bandwidth throttling could treat tasks in other scheduling classes > as an aggregate group having a reasonable aggregate deadline and CPU > budget. This has the advantage of allowing "abusive" testing to proceed, > which allows people requiring risky parameter settings to rely on this > testing. Additionally, it avoids complex progress checking and priority > setting on the part of many kthreads throughout the system. However, > if this was an easy choice, the SCHED_DEADLINE developers would likely > have selected it. For example, it is necessary to determine what might > be a "reasonable" aggregate deadline and CPU budget. Reserving 5% > seems quite generous, and RCU's grace-period kthread would optimally > like a deadline in the milliseconds, but would do reasonably well with > many tens of milliseconds, and absolutely needs a few seconds. However, > for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might > well need a full CPU each! (This happens when the CPU being offloaded > generates a high rate of callbacks.) > > The goal of this proposal is therefore to generate face-to-face > discussion, hopefully resulting in a good and sufficient solution to > this problem. I would be happy to attend if this won't conflict with important things on the testing and fuzzing MC. If we restrict arguments for sched_attr, what would be the criteria for 100% safe arguments? Moving the check from kernel to user-space does not relief us from explicitly stating the condition in black-and-white way. All of sched_runtime/sched_deadline/sched_period be not larger than 1 second? The problem is that syzkaller does not allow 100% reliable enforcement for indirect arguments in memory. E.g. inputs arguments can overlap, input/output can overlap, weird races affect what's actually being passed to kernel, the memory being mapped from a weird device, etc. And that's also useful as it can discover TOCTOU bugs, deadlocks, etc. We could try to wrap sched_setattr and do some additional restrictions by giving up on TOCTOU, device-mapped memory, etc. I am also thinking about dropping CAP_SYS_NICE, it should still allow some configurations, but no inherently unsafe ones.