Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient length

Miaoqing Pan <quic_miaoqing@xxxxxxxxxxx> · Tue, 11 Mar 2025 16:29:42 +0800

On 3/10/2025 6:09 PM, Johan Hovold wrote:
On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
A relatively unusual race condition occurs between host software
and hardware, where the host sees the updated destination ring head
pointer before the hardware updates the corresponding descriptor.
When this situation occurs, the length of the descriptor returns 0.

I still think this description is too vague and it doesn't explain how
this race is even possible. It sounds like there's a bug somewhere in
the driver or firmware, but if this really is an indication the hardware
is broken as your reply here seems to suggest:

	https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@xxxxxxxxxxx/

then that too should be highlighted in the commit message (e.g. by
describing this as "working around broken hardware").

Ok, will do.

The current error handling method is to increment descriptor tail
pointer by 1, but 'sw_index' is not updated, causing descriptor and
skb to not correspond one-to-one, resulting in the following error:

ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484

To address this problem, temporarily skip processing the current
descriptor and handle it again next time. However, to prevent this
descriptor from continuously returning 0, use skb cb to set a flag.
If the length returns 0 again, this descriptor will be discarded.

The ath12k ring-buffer handling looks very similar. Do you need a
corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
sure that this (hardware) bug only affects ath11k devices?

Yes, will submit a similar patch.

  	*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
-	if (*nbytes == 0) {
-		ret = -EIO;
-		goto err;
+	if (unlikely(*nbytes == 0)) {
+		struct ath11k_skb_rxcb *rxcb =
+			ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
+
+		/* A relatively unusual race condition occurs between host
+		 * software and hardware, where the host sees the updated
+		 * destination ring head pointer before the hardware updates
+		 * the corresponding descriptor.
+		 *
+		 * Temporarily skip processing the current descriptor and handle
+		 * it again next time. However, to prevent this descriptor from
+		 * continuously returning 0, set 'is_desc_len0' flag. If the
+		 * length returns 0 again, this descriptor will be discarded.
+		 */
+		if (!rxcb->is_desc_len0) {
+			rxcb->is_desc_len0 = true;
+			ret = -EIO;
+			goto err;
+		}
  	}

I'm still waiting for feedback from one user that can reproduce the
ring-buffer corruption very easily, but another user mentioned seeing
multiple zero-length descriptor warnings over the weekend when running
with this patch:

	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)

Are there ever any valid reasons for seeing a zero-length descriptor
(i.e. unrelated to the race at hand)? IIUC the warning would only be
printed when processing such descriptors a second time (i.e. when
is_desc_len0 is set).

That's exactly the logic, only can see the warning in a second time. For 
the first time, ath12k_ce_completed_recv_next() returns '-EIO'.