Hi Bjorn, >-----Original Message----- >From: Bjorn Helgaas <helgaas@xxxxxxxxxx> >Sent: Monday, November 11, 2024 11:59 PM >To: K, Kiran <kiran.k@xxxxxxxxx> >Cc: linux-bluetooth@xxxxxxxxxxxxxxx; Srivatsa, Ravishankar ><ravishankar.srivatsa@xxxxxxxxx>; Tumkur Narayan, Chethan ><chethan.tumkur.narayan@xxxxxxxxx>; Devegowda, Chandrashekar ><chandrashekar.devegowda@xxxxxxxxx> >Subject: Re: [PATCH v1 1/2] Bluetooth: btintel_pcie: Add handshake between >driver and firmware > >On Tue, Oct 01, 2024 at 04:14:50PM +0530, Kiran K wrote: >> The following handshake mechanism needs be followed after firmware >> download is completed to bring the firmware to running state. >> >> After firmware fragments of Operational image are downloaded and >> secure sends result of the image succeeds, >> >> 1. Driver sends HCI Intel reset with boot option #1 to switch FW image. >> 2. FW sends Alive GP[0] MSIx >> 3. Driver enables data path (doorbell 0x460 for RBDs, etc...) 4. >> Driver gets Bootup event from firmware 5. Driver performs D0 entry to >> device (WRITE to IPC_Sleep_Control =0x0) 6. FW sends Alive GP[0] MSIx >> 7. Device host interface is fully set for BT protocol stack operation. >> 8. Driver may optionally get debug event with ID 0x97 which can be >> dropped >> >> For Intermediate loadger image, all the above steps are applicable >> expcept #5 and #6. > >I see this is already applied to bluetooth-next and is probably immutable, but >if not... > >s/loadger/loader/ >s/expcept/except/ > >But more importantly, the race below is still in linux-next as of next-20241111. >I mentioned this before at >https://lore.kernel.org/all/20241023191352.GA924610@bhelgaas/#t, but it >probably got lost in the suspend/resume conversation. > >> @@ -1053,8 +1272,11 @@ static int btintel_pcie_send_frame(struct hci_dev >*hdev, >> struct sk_buff *skb) >> { >> struct btintel_pcie_data *data = hci_get_drvdata(hdev); >> + struct hci_command_hdr *cmd; >> + __u16 opcode = ~0; >> int ret; >> u32 type; >> + u32 old_ctxt; >> >> /* Due to the fw limitation, the type header of the packet should be >> * 4 bytes unlike 1 byte for UART. In UART, the firmware can read @@ >> -1073,6 +1295,8 @@ static int btintel_pcie_send_frame(struct hci_dev *hdev, >> switch (hci_skb_pkt_type(skb)) { >> case HCI_COMMAND_PKT: >> type = BTINTEL_PCIE_HCI_CMD_PKT; >> + cmd = (void *)skb->data; >> + opcode = le16_to_cpu(cmd->opcode); >> if (btintel_test_flag(hdev, INTEL_BOOTLOADER)) { >> struct hci_command_hdr *cmd = (void *)skb->data; >> __u16 opcode = le16_to_cpu(cmd->opcode); @@ - >1111,6 +1335,30 @@ >> static int btintel_pcie_send_frame(struct hci_dev *hdev, >> bt_dev_err(hdev, "Failed to send frame (%d)", ret); >> goto exit_error; >> } >> + >> + if (type == BTINTEL_PCIE_HCI_CMD_PKT && >> + (opcode == HCI_OP_RESET || opcode == 0xfc01)) { >> + old_ctxt = data->alive_intr_ctxt; >> + data->alive_intr_ctxt = >> + (opcode == 0xfc01 ? BTINTEL_PCIE_INTEL_HCI_RESET1 >: >> + BTINTEL_PCIE_HCI_RESET); >> + bt_dev_dbg(data->hdev, "sent cmd: 0x%4.4x alive context >changed: %s -> %s", >> + opcode, btintel_pcie_alivectxt_state2str(old_ctxt), >> + btintel_pcie_alivectxt_state2str(data- >>alive_intr_ctxt)); >> + if (opcode == HCI_OP_RESET) { >> + data->gp0_received = false; >> + ret = wait_event_timeout(data->gp0_wait_q, >> + data->gp0_received, >> + >msecs_to_jiffies(BTINTEL_DEFAULT_INTR_TIMEOUT_MS)); > >This looks like a race. You're apparently started something previously that will >eventually result in data->gp0_received being set. The ordering you expect is >this: > > 1) Tell device to do something and interrupt on completion > 2) Set data->gp0_received = false here > 3) Call wait_event_timeout() here > 4) Completion interrupt causes btintel_pcie_msix_gp0_handler() to be > called > 5) btintel_pcie_msix_gp0_handler() sets data->gp0_received = true > 6) wait_event_timeout() completes > >But this ordering can also happen: > > 1) Tell device to do something and interrupt on completion > 2) Completion interrupt causes > btintel_pcie_msix_gp0_handler() to be called > 3) btintel_pcie_msix_gp0_handler() sets data->gp0_received = true > 4) Set data->gp0_received = false here, overwriting the "true" > 5) Call wait_event_timeout() here > 6) wait_event_timeout() times out because completion interrupt has > already happened > 7) We complain "No alive interrupt received" here > Possible. Unfortunately, this issue did not show up in our test. I will reproduce this issue and publish the same. >You should be able to force this failure by adding a sleep before setting data- >>gp0_received = false in this patch. Ack. > >> + if (!ret) { >> + hdev->stat.err_tx++; >> + bt_dev_err(hdev, "No alive interrupt received >for %s", >> + btintel_pcie_alivectxt_state2str(data- >>alive_intr_ctxt)); >> + ret = -ETIME; >> + goto exit_error; >> + } >> + } >> + } >> hdev->stat.byte_tx += skb->len; >> kfree_skb(skb); Thanks, Kiran