Re: [PATCH 4/6] usbip: fix stub_dev usbip_sockfd_store() races leading to gpf

Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx> · Tue, 9 Mar 2021 08:22:08 -0700

On 3/8/21 9:27 AM, Shuah Khan wrote:
On 3/8/21 3:10 AM, Tetsuo Handa wrote:
On 2021/03/08 16:35, Tetsuo Handa wrote:
On 2021/03/08 12:53, Shuah Khan wrote:
Fix the above problems:
- Stop using kthread_get_run() macro to create/start threads.
- Create threads and get task struct reference.
- Add kthread_create() failure handling and bail out.
- Hold usbip_device lock to update local and shared states after
   creating rx and tx threads.
- Update usbip_device status to SDEV_ST_USED.
- Update usbip_device tcp_socket, sockfd, tcp_rx, and tcp_tx
- Start threads after usbip_device (tcp_socket, sockfd, tcp_rx, tcp_tx,
   and status) is complete.

No, the whole usbip_sockfd_store() etc. should be serialized using a 
mutex,
for two different threads can open same file and write the same 
content at
the same moment. This results in seeing SDEV_ST_AVAILABLE and 
creating two
threads and overwiting global variables and setting SDEV_ST_USED and 
starting
two threads by each of two thread, which will later fail to call 
kthread_stop()
on one of two thread because global variables are overwritten.

kthread_crate() (which involves GFP_KERNEL allocation) can take long 
time
enough to hit

   usbip_sockfd_store() must perform

       if (sdev->ud.status != SDEV_ST_AVAILABLE) {

Oops. This is

    if (sdev->ud.status == SDEV_ST_AVAILABLE) {

of course.

         /* misc assignments for attach operation */
         sdev->ud.status = SDEV_ST_USED;
       }

   under a lock, or multiple ud->tcp_{tx,rx} are created (which will 
later
   cause a crash like [1]) and refcount on ud->tcp_socket is leaked when
   usbip_sockfd_store() is concurrently called.

problem. That's why my patch introduced usbip_event_mutex lock.

And I think that same serialization is required between 
"rh_port_connect() from attach_store()" and
"rh_port_disconnect() from vhci_shutdown_connection() via 
usbip_event_add(&vdev->ud, VDEV_EVENT_DOWN)
  from vhci_port_disconnect() from detach_store()", for both 
vhci_rx_pdu() from vhci_rx_loop() and
vhci_port_disconnect() from detach_store() can queue VDEV_EVENT_DOWN 
event which can be processed
without waiting for attach_store() to complete.

Yes. We might need synchronization between events, threads, and shutdown
in usbip_host side and in connection polling and threads in vhci.

I am also looking at the shutdown sequences closely as well since the
local state is referenced without usbip_device lock in these paths.

I am approaching these problems as peeling the onion an expression so
we can limit the changes and take a spot fix approach. We have the
goal to address these crashes and not introduce regressions.

I don't seem to be able to reproduce these problems consistently in my
env. with the reproducer. I couldn't reproduce them in normal case at
all. Hence, the this cautious approach that reduces the chance of
regressions and if we see regressions, they can fixed easily.

https://syzkaller.appspot.com/text?tag=ReproC&x=14801034d00000

If this patch series fixes the problems you are seeing, I would like
get these fixes in and address the other two potential race conditions
in another round of patches. I also want to soak these in the next
for a few weeks.

Please let me know if these patches fix the problems you are seeing in 
your env.

Can you verify these patches in your environment and see if you are
seeing any problems? I want to first see where we are with these
fixes.

thanks,
-- Shuah