Re: [RFC 0/5] fix races in CDC-WDM

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 18 Sep 2020 01:17:39 +0900

On 2020/09/17 23:17, Oliver Neukum wrote:
> The API and its semantics are clear. Write schedules a write:
> 
>        A  successful  return  from  write() does not make any guarantee that data has been committed to disk.  On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data.  In this case, some errors might be
>        delayed until a future write(2), fsync(2), or even close(2).  The only way to be sure is to call fsync(2) after you are done writing all your data.

But I think that this leaves a room for allowing write() to imply fflush()
(i.e. write() is allowed to wait for data to be committed to disk).

> 
> Fsync flushes data:
> 
>        fsync()  transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if
>        the system crashes or is rebooted.  This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.
> 
> If user space does not call fsync(), the error is supposed to be reported
> by the next write() and if there is no next write(), close() shall report it.

Where does "the next" (and not "the next after the next") write() come from?

>> You did not answer to
>>
>>   How do we guarantee that N'th write() request already set desc->werr before
>>   (N+1)'th next write() request is issued? If (N+1)'th write() request reached
>>   memdup_user() before desc->werr is set by callback of N'th write() request,
>>   (N+1)'th write() request will fail to report the error from N'th write() request.
>>   Yes, that error would be reported by (N+2)'th write() request, but the userspace
>>   process might have already discarded data needed for taking some actions (e.g.
>>   print error messages, retry the write() request with same argument).
> 
> We don't, nor do we have to. You are right, error reporting can be
> improved. I implemented fsync() to do so.

You are saying that if user space does not call fsync(), the error is allowed to be
reported by the next after the next (in other words, (N+2)'th) write() ?

> 
>> . At least I think that
>>
>>         spin_lock_irq(&desc->iuspin);
>>         we = desc->werr;
>>         desc->werr = 0;
>>         spin_unlock_irq(&desc->iuspin);
>>         if (we < 0)
>>                 return usb_translate_errors(we);
>>
>> in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags).
> 
> Why?

Otherwise, we can't make sure (N+1)'th write() will report error from N'th write().

Since I don't know the characteristics of data passed via wdm_write() (I guess that
the data is some stateful controlling commands rather than meaningless byte stream),
I guess that (N+1)'th wdm_write() attempt should be made only after confirming that
N'th wdm_write() attempt received wdm_callback() response. To preserve state / data
used by N'th wdm_write() attempt, reporting the error from too late write() attempt
would be useless.

>> In addition, is
>>
>>         /* using write lock to protect desc->count */
>>         mutex_lock(&desc->wlock);
>>
>> required? Isn't wdm_mutex that is actually protecting desc->count from modification?
>> If it is desc->wlock that is actually protecting desc->count, the !desc->count check
>> in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with
>> desc->wlock held.
> 
> Correct. So should wdm_mutex be dropped earlier?

If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already
protects desc->count. I don't know how this module works. I don't know whether
wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from
wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when
recover_from_urb_loss() is called from wdm_post_reset().

By the way, after the fixes, we could replace

  spin_lock_irq(&desc->iuspin);
  rv = desc->werr;
  desc->werr = 0;
  spin_unlock_irq(&desc->iuspin);

with

  rv = xchg(&desc->werr, 0);

and avoid spin_lock_irq()/spin_unlock_irq() because there are many
locations which needs to check and clear the error...