Re: [patch 00/13] preempt: Make preempt count unconditional

Daniel Vetter <daniel@xxxxxxxx> · Wed, 16 Sep 2020 23:43:02 +0200

On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:

On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:

On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

OTOH, having a working 'preemptible()' or maybe better named
'can_schedule()' check makes tons of sense to make decisions about
allocation modes or other things.

No. I think that those kinds of decisions about actual behavior are
always simply fundamentally wrong.

Note that this is very different from having warnings about invalid
use. THAT is correct. It may not warn in all configurations, but that
doesn't matter: what matters is that it warns in common enough
configurations that developers will catch it.

So having a warning in "might_sleep()" that doesn't always trigger,
because you have a limited configuration that can't even detect the
situation, that's fine and dandy and intentional.

But having code like

       if (can_schedule())
           .. do something different ..

is fundamentally complete and utter garbage.

It's one thing if you test for "am I in hardware interrupt context".
Those tests aren't great either, but at least they make sense.

But a driver - or some library routine - making a difference based on
some nebulous "can I schedule" is fundamentally and basically WRONG.

If some code changes behavior, it needs to be explicit to the *caller*
of that code.

So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
do_something_atomic()" is pure shite.

And I am not IN THE LEAST interested in trying to help people doing
pure shite. We need to fix them. Like the crypto code is getting
fixed.

Just figured I'll throw my +1 in from reading too many (gpu) drivers.
Code that tries to cleverly adjust its behaviour depending upon the
context it's running in is harder to understand and blows up in more
interesting ways. We still have drm_can_sleep() and it's mostly just
used for debug code, and I've largely ended up just deleting
everything that used it because when you're driver is blowing up the
last thing you want is to realize your debug code and output can't be
relied upon. Or worse, that the only Oops you have is the one in the
debug code, because the real one scrolled away - the original idea
behind drm_can_sleep was to make all the modeset code work
automagically both in normal ioctl/kworker context and in the panic
handlers or kgdb callbacks. Wishful thinking at best.

Also at least for me that extends to everything, e.g. I much prefer
explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
locks shared with interrupt handlers, since the former two gives me
clear information from which contexts such function can be called.
Other end is the memalloc_no*_save/restore functions, where I recently
made a real big fool of myself because I didn't realize how much that
impacts everything that's run within - suddenly "GFP_KERNEL for small
stuff never fails" is wrong everywhere.

It's all great for debugging and sanity checks (and we run with all
that stuff enabled in our CI), but really semantic changes depending
upon magic context checks freak my out :-)

All fair, but some of us need to write code that must handle being
invoked from a wide variety of contexts.  Now perhaps you like the idea of
call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
are held, and so on.  However, from what I can see, most people instead
consistently prefer that the RCU API instead be consolidated.

Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
needs to be able to allocate memory occasionally.  It can do that when
invoked from some contexts, but not when invoked from others.  Right now,
in !PREEMPT kernels, it cannot tell, and must either do things to the
memory allocators that some of the MM hate or must unnecessarily invoke
workqueues.  Thomas's patches would allow the code to just allocate in
the common case when these primitives are invoked from contexts where
allocation is permitted.

If we want to restrict access to the can_schedule() or whatever primitive,
fine and good.  We can add a check to checkpatch.pl, for example.  Maybe
we can go back to the old brlock approach of requiring certain people's
review for each addition to the kernel.

But there really are use cases that it would greatly help.

We can deadlock in random fun places if random stuff we're calling
suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
make it extra fun to reproduce. Maybe most driver subsystems are less
brittle, but gpu drivers definitely need to know about the details for
exactly this example. And yes gpu drivers use rcu for freeing
dma_fence structures, and that tends to happen in code that we only
recently figured out should really not allocate memory.

I think minimally you need to throw in an unconditional
fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
with full debugging knows what might happen. It's kinda like
might_sleep, but a lot more specific. might_sleep() alone is not
enough, because in the specific code paths I'm thinking of (and
created special lockdep annotations for just recently) sleeping is
allowed, but any memory allocations with GFP_RECLAIM set are no-go.

Completely agreed!  Any allocation on any free path must be handled
-extremely- carefully.  To that end...

First, there is always a fallback in case the allocation fails.  Which
might have performance or corner-case robustness issues, but which will
at least allow forward progress.  Second, we consulted with a number of
MM experts to arrive at appropriate GFP_* flags (and their patience is
greatly appreciated).  Third, the paths that can allocate will do so about
one time of 500, so any issues should be spotted sooner rather than later.

So you are quite right to be concerned, but I believe we will be doing the
right things.  And based on his previous track record, I am also quite
certain that Mr. Murphy will be on hand to provide me any additional
education that I might require.

Finally, I have noted down your point about fs_reclaim_acquire() and
fs_reclaim_release().  Whether or not they prove to be needed, I do
appreciate your calling them to my attention.

I just realized that since these dma_fence structs are refcounted and
userspace can hold references (directly, it can pass them around
behind file descriptors) we might never hit such a path until slightly
unusual or evil userspace does something interesting. Do you have
links to those patches? Some googling didn't turn up anything. I can
then figure out whether it's better to risk not spotting issues with
call_rcu vs slapping a memalloc_noio_save/restore around all these
critical section which force-degrades any allocation to GFP_ATOMIC at
most, but has the risk that we run into code that assumes "GFP_KERNEL
never fails for small stuff" and has a decidedly less tested fallback
path than rcu code.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch