On 06-May-19 21:31, Jason Gunthorpe wrote: > On Mon, May 06, 2019 at 04:51:00PM +0300, Gal Pressman wrote: >>>>>>>> +static void efa_com_admin_flush(struct efa_com_dev *edev) >>>>>>>> +{ >>>>>>>> + struct efa_com_admin_queue *aq = &edev->aq; >>>>>>>> + >>>>>>>> + clear_bit(EFA_AQ_STATE_RUNNING_BIT, &aq->state); >>>>>>> >>>>>>> This scheme looks use after free racey to me.. >>>>>> >>>>>> The running bit stops new admin commands from being submitted, clearly the exact >>>>>> moment in which the bit is cleared is "racy" to submission of admin commands but >>>>>> that is taken care of. >>>>>> >>>>>> The submission of an admin command is atomic as it is guarded by an admin queue >>>>>> lock. >>>>>> The same lock is acquired by this flow as well when flushing the admin queue. >>>>>> After all admin commands have been aborted and we know for sure that >>>>>> no new >>>>> >>>>> The problem is the 'abort' does nothing to ensure parallel threads are >>>>> no longer touching this memory, >>>> >>>> Which memory? The user threads touch the user allocated buffers which are not >>>> being freed on admin queue destroy. >>> >>> The memory the other thread is touching is freed a few lines below in >>> a devm_kfree. The apparent purpose of this code is to make the other >>> thread stop but does it wrong. >> >> Are we talking about the completion context/completion context pool? >> The user thread does use this memory, but this is done while the avail_cmds >> semaphore is down which means the wait_for_abort_completion function is still >> waiting for this thread to finish. > > We are talking about > > CPU 0 CPU 1 > efa_com_submit_admin_cmd() > spin_lock(&aq->sq.lock); > > efa_remove_device() > efa_com_admin_destroy() > efa_com_admin_flush() > [..] > kfree(aq) > > As long as efa_com_submit_admin_cmd() is running the semaphore is still "down" which means the wait function will be blocked. > > So, either there is no possible concurrency with the 'aq' users and > device removal, in which case all the convoluted locking in > efa_com_admin_flush() and related is unneeded > > Or there is concurrency and it isn't being torn down properly, so we > get the above race. > > My read is that all the 'admin commands' are done off of verbs > callbacks and ib_unregister_device is called before we get to > efa_remove_device (guaranteeing there are no running verbs callbacks), > so there is no possible concurrency and all this efa_com_admin_flush() > and related is pointless obfuscation. Delete it. You're right, the "abort" flow was overcautious as there shouldn't be any pending threads after ib_unregister_device. I will remove this flow.