On Wed, Jul 15, 2020 at 9:41 AM Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote: > > On Wed, 15 Jul 2020 07:48:06 PDT (-0700), Will Deacon wrote: > > On Wed, Jul 15, 2020 at 07:03:49AM -0700, Palmer Dabbelt wrote: > >> On Wed, 15 Jul 2020 03:42:46 PDT (-0700), Will Deacon wrote: > >> > Hmm. Although I _think_ something like the diff below ought to work, are you > >> > sure you want to be doing MMIO writes in preemptible context? Setting > >> > '.disable_locking = true' in 'sifive_gpio_regmap_config' implies to me that > >> > you should be handling the locking within the driver itself, and all the > >> > other regmap writes are protected by '&gc->bgpio_lock'. > >> > >> I guess my goal here was to avoid fixing the drivers: it's one thing if it's > >> just broken SiFive drivers, as they're all a bit crusty, but this is blowing up > >> for me in the 8250 driver on QEMU as well. At that point I figured there'd be > >> an endless stream of bugs around this and I'd rather just. > > > > Right, and my patch should solve that. > > > >> > Given that riscv is one of the few architectures needing an implementation > >> > of mmiowb(), doing MMIO in a preemptible section seems especially dangerous > >> > as you have no way to ensure completion of the writes without adding an > >> > mmiowb() to the CPU migration path (i.e. context switch). > >> > >> I was going to just stick one in our context switching code unconditionally. > >> While we could go track cumulative writes outside the locks, the mmiowb is > >> essentially free for us because the one RISC-V implementation treats all fences > >> the same way so the subsequent store_release would hold all this up anyway. > >> > >> I think the right thing to do is to add some sort of arch hook right about here > >> > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >> index cfd71d61aa3c..14b4f8b7433f 100644 > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -3212,6 +3212,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) > >> prev_state = prev->state; > >> vtime_task_switch(prev); > >> perf_event_task_sched_in(prev, current); > >> + finish_arch_pre_release(prev); > >> finish_task(prev); > >> finish_lock_switch(rq); > >> finish_arch_post_lock_switch(); > >> > >> but I was just going to stick it in switch_to for now... :). I guess we could > >> also roll the fence up into yet another one-off primitive for the scheduler, > >> something like > > > > What does the above get you over switch_to()? > > > >> > diff --git a/include/asm-generic/mmiowb.h b/include/asm-generic/mmiowb.h > >> > index 9439ff037b2d..5698fca3bf56 100644 > >> > --- a/include/asm-generic/mmiowb.h > >> > +++ b/include/asm-generic/mmiowb.h > >> > @@ -27,7 +27,7 @@ > >> > #include <asm/smp.h> > >> > > >> > DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state); > >> > -#define __mmiowb_state() this_cpu_ptr(&__mmiowb_state) > >> > +#define __mmiowb_state() raw_cpu_ptr(&__mmiowb_state) > >> > #else > >> > #define __mmiowb_state() arch_mmiowb_state() > >> > #endif /* arch_mmiowb_state */ > >> > @@ -35,7 +35,9 @@ DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state); > >> > static inline void mmiowb_set_pending(void) > >> > { > >> > struct mmiowb_state *ms = __mmiowb_state(); > >> > - ms->mmiowb_pending = ms->nesting_count; > >> > + > >> > + if (likely(ms->nesting_count)) > >> > + ms->mmiowb_pending = ms->nesting_count; > >> > >> Ya, that's one of the earlier ideas I had, but I decided it doesn't actually do > >> anything: if we're scheduleable then we know that pending and count are zero, > >> thus the check isn't necessary. It made sense late last night and still does > >> this morning, but I haven't had my coffee yet. > > > > What it does is prevent preemptible writeX() from trashing the state on > > another CPU, so I think it's a valid fix. I agree that it doesn't help > > you if you need mmiowb(), but then that _really_ should only be needed if > > you're holding a spinlock. If you're doing concurrent lockless MMIO you > > deserve all the pain you get. > > > > I don't get why you think the patch does nothing, as it will operate as > > expected if writeX() is called with preemption disabled, which is the common > > case. > > Aside from PREEMPT_RT, I don't understand how you can be scheduled onto a CPU > that has a non-zero nesting_count. Doesn't that mean that the CPU you're > scheduled on to is itself holding a spinlock, and therefor can't be scheduled > on? > > Sure, some interrupt could come in the middle, but it's still going to see the > non-zero nesting_count left over from the spinlock being held and therefor will > avoid trashing the accumulated mmiowb. As far as I can tell everything then > proceeds acceptably: when the interrupt unlocks it'll do an mmiowb (whether it > did an IO or not), which is sufficient to ensure that the IO from the > interrupted code is completed before the unlock from that code. > > I must be missing something here? Will and I talked for a bit, this patch is correct. He's going to send it, I'm promoting smp_mb__after_spinlock to include IO ordering so we don't break code when scheduling. Thanks! > > >> I'm kind of tempted to just declare "mmiowb() is fast on RISC-V, so let's do it > >> unconditionally everywhere it's necessary". IIRC that's essentially true on > >> the existing implementation, as it'll get rolled up to any upcoming fence > >> anyway. It seems like building any real machine that relies on the orderings > >> provided by mmiowb is going to have an infinate rabbit hole of bugs anyway, so > >> in that case we'd just rely on the hardware to elide the now unnecessary fences > >> so we'd just be throwing static code size at this wacky memory model and then > >> forgetting about it. > > > > If you can do that, that's obviously the best approach. > > > >> I'm going to send out a patch set that does all the work I think is necessary > >> to avoid fixing up the various drivers, with the accounting code to avoid > >> mmiowbs all over our port. I'm not sure I'm going to like it, but I guess we > >> can argue as to exactly how ugly it is :) > > > > Ok. > > > > Will