Re: [PATCH] Bluetooth: use hdev->workqueue when queuing hdev->{cmd,ncmd}_timer works

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sat, 3 Sep 2022 15:49:26 +0900




On 2022/09/03 6:31, Luiz Augusto von Dentz wrote:
> Hi Tetsuo,
> 
> On Fri, Sep 2, 2022 at 11:45 AM Luiz Augusto von Dentz <luiz.dentz@xxxxxxxxx> wrote:
>> Didn't we introduce HCI_CMD_DRAIN_WORKQUEUE exactly to avoid queuing
>> after the cancel pattern?

HCI_CMD_DRAIN_WORKQUEUE does not help for this case.

What extid=243b7d89777f90f7613b is reporting is

  hci_cmd_timeout() {                             hci_dev_do_reset() {
    starts sleeping due to e.g. preemption
                                                    hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE); // Sets HCI_CMD_DRAIN_WORKQUEUE flag
                                                    cancel_delayed_work(&hdev->cmd_timer); // does nothing because hci_cmd_timeout() is already running
                                                    cancel_delayed_work(&hdev->ncmd_timer);
                                                    drain_workqueue(hdev->workqueue) {
                                                      sets __WQ_DRAINING flag on hdev->workqueue
                                                      starts waiting for completion of all works on hdev->workqueue
    finishes sleeping due to e.g. preemption
    queue_work(hdev->workqueue,  &hdev->cmd_work) // <= complains attempt to queue work from system_wq into __WQ_DRAINING hdev->workqueue
  }
                                                      finishes waiting for completion of all works on hdev->workqueue
                                                      clears __WQ_DRAINING flag
                                                    }
                                                  }

race condition. Notice that cancel_delayed_work() does not wait for
completion of already started hci_cmd_timeout() callback.

If you need to wait for completion of already started callback,
you need to use _sync version (e.g. cancel_delayed_work_sync()).
And watch out for locking dependency when using _sync version.

>>                           I wonder if wouldn't be better to introduce
>> some function that disables/enables the workqueue so we don't have to
>> do extra tracking in the driver/subsystem?
>>
> 
> I was thinking on doing something like the following:
> 
> https://gist.github.com/Vudentz/a2288015fedbed366fcdb612264a9d16

That patch does not close race, for

@@ -4037,6 +4038,10 @@ static void hci_cmd_work(struct work_struct *work)
        BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
               atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
 
+       /* Don't queue while draining */
+       if (hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
+               return;
        /*
         * BUG: WE ARE FREE TO SLEEP FOR ARBITRARY DURATION IMMEDIATELY AFTER CHECKING THE FLAG.
         * ANY "TEST AND DO SOMETHING" NEEDS TO BE PROTECTED BY A LOCK MECHANISM.
         */
+
        /* Send queued commands */
        if (atomic_read(&hdev->cmd_cnt)) {
                skb = skb_dequeue(&hdev->cmd_q);

. In other words, HCI_CMD_DRAIN_WORKQUEUE does not fix what extid=63bed493aebbf6872647 is reporting.

If "TEST AND DO SOMETHING" does not sleep, RCU is a handy lock mechanism.

> 
> Since there is no reason to queue any command if we are draining and
> are gonna reset at the end it is pretty useless to queue commands at
> that point.

Then, you can add that check.