Re: Memory barrier needed with wake_up_process()?

Felipe Balbi <felipe.balbi@xxxxxxxxxxxxxxx> · Tue, 06 Sep 2016 14:43:39 +0300

Hi,

Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote:
>> On Mon, 5 Sep 2016, Peter Zijlstra wrote:
>> 
>> > > Actually, seeing it written out like this, one realizes that it really 
>> > > ought to be:
>> > > 
>> > > 	CPU 0				CPU 1
>> > >         -----				-----
>> > >         Start DMA			Handle DMA-complete irq
>> > >         Sleep until bh->state		smp_mb()
>> > > 					set bh->state
>> > > 					Wake up CPU 0
>> > > 	smp_mb()
>> > > 	Compute rc based on contents of the DMA buffer
>> > > 
>> > > (Bear in mind also that on some platforms, the I/O operation is carried 
>> > > out by PIO rather than DMA.)
>
>> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
>> > Its only defined to do CPU/CPU interactions.
>> 
>> Suppose the DMA master finishes filling up an input buffer and issues a
>> completion irq to CPU1.  Presumably the data would then be visible to
>> CPU1 if the interrupt handler looked at it.  So if CPU1 executes smp_mb
>> before setting bh->state and waking up CPU0, and if CPU0 executes
>> smp_mb after testing bh->state and before reading the data buffer,
>> wouldn't CPU0 then see the correct data in the buffer?  Even if CPU0 
>> never did go to sleep?
>
> Couple of notes here; I would expect the DMA master to make its stores
> _globally_ visible on 'completion'. Because I'm not sure our smp_mb()
> would help make it globally visible, since its only defined on CPU/CPU
> interactions, not external.
>
> Note that for example ARM has the distinction where smp_mb() uses
> DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY.
>
> The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The
> OSH is Outer-SHarable and adds external agents like DMA (also includes
> CPUs). The DSB-SY thing is even heavier and syncs world or something; I
> always forget these details.
>
>> Or would something more be needed?
>
> The thing is, this is x86 (TSO). Most everything is globally
> ordered/visible and full barriers.
>
> The only reorder allowed by TSO is a later read can happen before a
> prior store. That is, only:
>
> 	X = 1;
> 	smp_mb();
> 	r = Y;
>
> actually needs a full barrier. All the other variants are no-ops.
>
> Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()).
> Which seems to imply DMA is coherent and globally visible.
>
>> > I would very much expect the device IO stuff to order things for us in
>> > this case. "Start DMA" should very much include sufficient fences to
>> > ensure the data under DMA is visible to the DMA engine, this would very
>> > much include things like flushing store buffers and maybe even writeback
>> > caches, depending on platform needs.
>> > 
>> > At the same time, I would expect "Handle DMA-complete irq", even if it
>> > were done with a PIO polling loop, to guarantee ordering against later
>> > operations such that 'complete' really means that.
>> 
>> That's what I would expect too.
>> 
>> Back in the original email thread where the problem was first reported, 
>> Felipe says that the problem appears to be something else.  Here's what 
>> it looks like now, in schematic form:
>> 
>> 	CPU0
>> 	----
>> 	get_next_command():
>> 	  while (bh->state != BUF_STATE_EMPTY)
>> 		sleep_thread();
>> 	  start an input request for bh
>> 	  while (bh->state != BUF_STATE_FULL)
>> 		sleep_thread();
>> 
>> As mentioned above, the input involves DMA and is terminated by an irq.
>> The request's completion handler is bulk_out_complete():
>> 
>> 	CPU1
>> 	----
>> 	bulk_out_complete():
>> 	  bh->state = BUF_STATE_FULL;
>> 	  wakeup_thread();
>> 
>> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a 
>> value different from BUF_STATE_FULL.  So it goes back to sleep again 
>> and doesn't make any forward progress.
>> 
>> It's possible that something else is changing bh->state when it 
>> shouldn't.  But if this were the explanation, why would Felipe see that 
>> the problem goes away when he changes the memory barriers in 
>> sleep_thread() and wakeup_thread() to smp_mb()?
>
> Good question, let me stare at the code more.
>
> Could you confirm that bulk_{in,out}_complete() work on different
> usb_request structures, and they can not, at any time, get called on the
> _same_ request?

usb_requests are allocated for a specific endpoint and USB Device
Controller (UDC) drivers refuse to queue requests allocated for epX to
epY, so this really can never happen.

My fear now, however, is that changing smp_[rw]mb() to smp_mb() just
adds extra overhead which makes the problem much, much less likely to
happen. Does that sound plausible to you?

-- 
balbi
Attachment:
signature.asc

Description: PGP signature