Re: issues with commit c1da59dad0eb ("cdc-wdm: Clear read pipeline in case of error")

Robert Foss <robert.foss@xxxxxxxxxxxxx> · Thu, 20 Apr 2017 17:18:35 -0400




Hi Björn,

Thanks for the thorough and explicit feedback, it was rather helpful.


On 2017-04-20 04:32 AM, Bjørn Mork wrote:
Hello Robert,

Sorry for being much too late here, but during recent attemts to debug
issues caused by my commit 833415a3e781 ("cdc-wdm: fix "out-of-sync" due
to missing notifications") I believe I found a couple of issues with
commit c1da59dad0eb. At least one of them is serious (potentional
GPF_KERNEL allocation in interrupt context).

But I don't know if I understand the problem the commit set out to
solves, which is why I have no proposed patch here.


The description of the situation that I received was then one
from the commit message:

"In case of a read error, userspace may not actually read the data, since the
driver returns an error through wdm_poll. After this, the underlying device may
attempt to send us more data, but the queue is not processed. While userspace is
also blocked, because the read error is never cleared."

So the way I interpret it is that after an error condition the device can get 
prevented from sending data, which in turn can cause userspace to not issue a 
read (which would have cleared the error).

I could certainly be wrong in this interpretation or be misunderstanding something.

Prathmesh, Guenter: Do you have any comments?



@@ -189,7 +192,13 @@ static void wdm_in_callback(struct urb *urb)
 		}
 	}

-	desc->rerr = status;
+	/*
+	 * only set a new error if there is no previous error.
+	 * Errors are only cleared during read/open
+	 */
+	if (desc->rerr  == 0)
+		desc->rerr = status;
+
 	if (length + desc->length > desc->wMaxCommand) {
 		/* The buffer would overflow */
 		set_bit(WDM_OVERFLOW, &desc->flags);



This is minor, but in my opinion this change is flawed.  It changes the
meaning of "desc->rerr" from the original "last status code" to the new
"first error since last userspace read".

wdm_read will only return and clear such errors if desc->length == 0 on
entry.  This is up to the userspace application to decide. I claim that
any error should either be reported immediately or not at all.  Caching
errors forever like this, and then reporting them to userspace at some
arbitrary later event, is pointless.

The existing code did the correct thing in the absence of proper event
queing.  The only real alternative would be to completely stop
processing device events until userspace has seen the error.  But that
would just make things fail completely for no good reason.

We should probably consider making a queue/list of the read buffers and
keep each status code with the associated list element.  Then we could
report back the error when userspace reads the element that caused it.

I see your point about changing the semantics of desc->rerr, and agree that it 
probably isn't desirable.

So an event queue is starting to look like the way forward.


@@ -221,9 +230,19 @@ static void wdm_in_callback(struct urb *urb)
 	}

 skip_error:
+	set_bit(WDM_READ, &desc->flags);
 	wake_up(&desc->wait);

-	set_bit(WDM_READ, &desc->flags);
+	if (desc->rerr) {
+		/*
+		 * Since there was an error, userspace may decide to not read
+		 * any data after poll'ing.
+		 * We should respond to further attempts from the device to send
+		 * data, so that we can get unstuck.
+		 */
+		service_outstanding_interrupt(desc);
+	}
+
 unlock:
 	spin_unlock(&desc->iuspin);
  }

We cannot do this.  This is an URB callback, and service_outstanding_interrupt()
calls  usb_submit_urb(...,GFP_KERNEL).

This issue could of course be avoided by simply changing the allocation
flags. But looking at the comment here, I am pretty sure that this is
the wrong solution to the unanticipated userspace behaviour.  If
userspace decides not to read() if poll() sets POLLERR, then it *should
not* expect the error condition to be cleared. Or am I missing something?

The allocation flags aside, calling poll() and it returning POLLERR should
not clear the error, so the behavior of the


In any case: This will not unstick anything, given the previous problem
with desc->rerr not being updated unless userspace reads.  You will just
flush the device queue, but the error condition is still there. So
poll() will still set POLLERR.  If the comment is correct, this is a
permanent deadlock.

userspace can avoid tge issue by reading the descriptor on POLLERR. If
that is not the way this is expected to work (I am by no means a POSIX
expert), then I believe the correct fix must be to change wdm_poll(),
either to clear the error condition or to not set POLLERR in this case.

So, in lieu of an actual event queue, maybe this patch should be reverted.
Thoughts?


Rob.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html