> From: Sunil Muthuswamy <sunilmut@xxxxxxxxxxxxx> > Sent: Thursday, May 16, 2019 11:11 AM > > Hi Sunil, > > To make it clear, your patch itself is good, and I was just talking about > > the next change we're going to make. Once we make the next change, > > IMO we need a further patch to schedule hvs_close_timeout() to the new > > single-threaded workqueue rather than the global "system_wq". > > > Thanks for your review. Can you add a 'signed-off' from your side to the patch. I have provided my Reviewed-by. I guess this should be enough. Of course, David makes the final call. It would be great if the maintaners of the Hyper-V drivers listed in the "To:" could provide their Signed-off-by. > > > Next, we're going to remove the "channel->rescind" check in > > > vmbus_hvsock_device_unregister() -- when doing that, IMO we need to > > > fix a potential race revealed by the schedule_delayed_work() in this > > > patch: > > > > > > When hvs_close_timeout() finishes, the "sk" struct has been freed, but > > > vmbus_onoffer_rescind() -> channel->chn_rescind_callback(), i.e. > > > hvs_close_connection(), may be still running and referencing the "chan" > > > and "sk" structs (), which should no longer be referenced when > > > hvs_close_timeout() finishes, i.e. "get_per_channel_state(chan)" is no > > > longer safe. The problem is: currently there is no sync mechanism > > > between vmbus_onoffer_rescind() and hvs_close_timeout(). > > > > > > The race is a real issue only after we remove the "channel->rescind" > > > in vmbus_hvsock_device_unregister(). > > > > A correction: IMO the race is real even for the current code, i.e. without > > your patch: in vmbus_onoffer_rescind(), between we set channel->rescind > > and we call channel->chn_rescind_callback(), the channel may have been > > freed by vmbus_hvsock_device_unregister(). > > > > This race window is small and I guess that's why we never noticed it. > > > > > I guess we need to introduce a new single-threaded workqueue in the > > > vmbus driver, and offload both vmbus_onoffer_rescind() and > > > hvs_close_timeout() onto the new workqueue. > > > Something is a miss if the guest has to wait for the host to close the channel > prior to cleaning it up from it's side. That's waste of resources, doesn't matter I agree. > if you do it in a system thread, dedicated pool or anyway else. I think the right > way to deal with this is to unregister the rescind callback routine, wait for any > running rescind callback routine to finish and then drop the last reference to > the socket, which should lead to all the cleanup. I understand that some of the > facility of unregistering the rescind callback might not exist today. Considering the concurrency, I'm not sure if it's easy or possible to safely unregister the chn_rescind_callback. My hunch is: doing that may be more difficult than adding a new single-thread workqueue. Thanks, -- Dexuan