On 2022/09/03 6:31, Luiz Augusto von Dentz wrote: > Hi Tetsuo, > > On Fri, Sep 2, 2022 at 11:45 AM Luiz Augusto von Dentz <luiz.dentz@xxxxxxxxx> wrote: >> Didn't we introduce HCI_CMD_DRAIN_WORKQUEUE exactly to avoid queuing >> after the cancel pattern? HCI_CMD_DRAIN_WORKQUEUE does not help for this case. What extid=243b7d89777f90f7613b is reporting is hci_cmd_timeout() { hci_dev_do_reset() { starts sleeping due to e.g. preemption hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE); // Sets HCI_CMD_DRAIN_WORKQUEUE flag cancel_delayed_work(&hdev->cmd_timer); // does nothing because hci_cmd_timeout() is already running cancel_delayed_work(&hdev->ncmd_timer); drain_workqueue(hdev->workqueue) { sets __WQ_DRAINING flag on hdev->workqueue starts waiting for completion of all works on hdev->workqueue finishes sleeping due to e.g. preemption queue_work(hdev->workqueue, &hdev->cmd_work) // <= complains attempt to queue work from system_wq into __WQ_DRAINING hdev->workqueue } finishes waiting for completion of all works on hdev->workqueue clears __WQ_DRAINING flag } } race condition. Notice that cancel_delayed_work() does not wait for completion of already started hci_cmd_timeout() callback. If you need to wait for completion of already started callback, you need to use _sync version (e.g. cancel_delayed_work_sync()). And watch out for locking dependency when using _sync version. >> I wonder if wouldn't be better to introduce >> some function that disables/enables the workqueue so we don't have to >> do extra tracking in the driver/subsystem? >> > > I was thinking on doing something like the following: > > https://gist.github.com/Vudentz/a2288015fedbed366fcdb612264a9d16 That patch does not close race, for @@ -4037,6 +4038,10 @@ static void hci_cmd_work(struct work_struct *work) BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name, atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q)); + /* Don't queue while draining */ + if (hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE)) + return; /* * BUG: WE ARE FREE TO SLEEP FOR ARBITRARY DURATION IMMEDIATELY AFTER CHECKING THE FLAG. * ANY "TEST AND DO SOMETHING" NEEDS TO BE PROTECTED BY A LOCK MECHANISM. */ + /* Send queued commands */ if (atomic_read(&hdev->cmd_cnt)) { skb = skb_dequeue(&hdev->cmd_q); . In other words, HCI_CMD_DRAIN_WORKQUEUE does not fix what extid=63bed493aebbf6872647 is reporting. If "TEST AND DO SOMETHING" does not sleep, RCU is a handy lock mechanism. > > Since there is no reason to queue any command if we are draining and > are gonna reset at the end it is pretty useless to queue commands at > that point. Then, you can add that check.