Hey Suraj, Your email client didn't seem to quote my response in your latest reply, so its difficult to figure out what you're writing vs me below. It also seems to have messed with the line breaks so I'm manually redoing those. Please see if you can figure out how to make that happen for further replies! More comments below... On Tue, Sep 10, 2024 at 12:47:08PM GMT, Suraj Jaiswal wrote: > > > -----Original Message----- > From: Andrew Halaney <ahalaney@xxxxxxxxxx> > Sent: Wednesday, September 4, 2024 3:47 AM > To: Suraj Jaiswal (QUIC) <quic_jsuraj@xxxxxxxxxxx> > Cc: Vinod Koul <vkoul@xxxxxxxxxx>; bhupesh.sharma@xxxxxxxxxx; Andy Gross <agross@xxxxxxxxxx>; Bjorn Andersson <andersson@xxxxxxxxxx>; Konrad Dybcio <konrad.dybcio@xxxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>; Jakub Kicinski <kuba@xxxxxxxxxx>; Rob Herring <robh+dt@xxxxxxxxxx>; Krzysztof Kozlowski <krzysztof.kozlowski+dt@xxxxxxxxxx>; Conor Dooley <conor+dt@xxxxxxxxxx>; Alexandre Torgue <alexandre.torgue@xxxxxxxxxxx>; Jose Abreu <joabreu@xxxxxxxxxxxx>; Maxime Coquelin <mcoquelin.stm32@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-arm-msm@xxxxxxxxxxxxxxx; devicetree@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-stm32@xxxxxxxxxxxxxxxxxxxxxxxxxxxx; Prasad Sodagudi <psodagud@xxxxxxxxxxx>; Rob Herring <robh@xxxxxxxxxx>; kernel <kernel@xxxxxxxxxxx> > Subject: Re: [PATCH net] net: stmmac: Stop using a single dma_map() for multiple descriptors > > WARNING: This email originated from outside of Qualcomm. Please be wary of any links or attachments, and do not enable macros. > > On Mon, Sep 02, 2024 at 03:24:36PM GMT, Suraj Jaiswal wrote: > > Currently same page address is shared > > between multiple buffer addresses and causing smmu fault for other > > descriptor if address hold by one descriptor got cleaned. > > Allocate separate buffer address for each descriptor for TSO path so > > that if one descriptor cleared it should not clean other descriptor > > address. snip... > > > > static void stmmac_flush_tx_descriptors(struct stmmac_priv *priv, int > > queue) @@ -4351,25 +4380,17 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev) > > pay_len = 0; > > } > > > > - stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue); > > + if (stmmac_tso_allocator(priv, (skb->data + proto_hdr_len), > > + tmp_pay_len, nfrags == 0, queue, false)) > > + goto dma_map_err; > > Changing the second argument here is subtly changing the dma_cap.addr64 <= 32 > case right before this. Is that intentional? > > i.e., prior, pretend des = 0 (side note but des is a very confusing variable > name for "dma address" when there's also mentions of desc meaning "descriptor" > in the DMA ring). In the <= 32 case, we'd call stmmac_tso_allocator(priv, 0) > and in the else case we'd call stmmac_tso_allocator(priv, 0 + proto_hdr_len). > > With this change in both cases its called with the (not-yet-dma-mapped) > skb->data + proto_hdr_len always (i.e. like the else case). > > Honestly, the <= 32 case reads weird to me without this patch. It seems some > of the buffer is filled but des is not properly incremented? > > I don't know how this hardware is supposed to be programmed (no databook > access) but that seems fishy (and like a separate bug, which would be nice to > squash if so in its own patch). Would you be able to explain the logic there > to me if it does make sense to you? > > <Suraj> des can not be 0 . des 0 means dma_map_single() failed and it will return. > If we see if des calculation (first->des1 = cpu_to_le32(des + proto_hdr_len);) > and else case des calculator ( des += proto_hdr_len;) it is adding proto_hdr_len > to the memory that we after mapping skb->data using dma_map_single. > Same way we added proto_hdr_len in second argument . 0 was just an example, and a confusing one, sorry. Let me paste the original fishy code that I think you've modified the behavior for. Here's the original: if (priv->dma_cap.addr64 <= 32) { first->des0 = cpu_to_le32(des); /* Fill start of payload in buff2 of first descriptor */ if (pay_len) first->des1 = cpu_to_le32(des + proto_hdr_len); /* If needed take extra descriptors to fill the remaining payload */ tmp_pay_len = pay_len - TSO_MAX_BUFF_SIZE; } else { stmmac_set_desc_addr(priv, first, des); tmp_pay_len = pay_len; des += proto_hdr_len; pay_len = 0; } stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue); Imagine the <= 32 case. Let's say des is address 0 (just for simplicity sake, let's assume that's valid). That means: first->des0 = des; first->des1 = des + proto_hdr_len; stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue) if des is 0, proto_hdr_len is 64, then that means first->des0 = 0 first->des1 = 64 stmmac_tso_allocator(priv, 0, tmp_pay_len, (nfrags == 0), queue) That seems fishy to me. We setup up the first descriptor with the beginning of des, and then the code goes and sets up more descriptors (stmmac_tso_allocator()) starting with the same des again? Should we be adding the payload length (TSO_MAX_BUFF_SIZE I suppose based on the tmp_pay_len = pay_len - TSO_MAX_BUFFSIZE above)? It seems that <= 32 results in duplicate data in both the "first" descriptor programmed above, and in the "second" descriptor programmed in stmmac_tso_allocator(). Also, since tmp_pay_len is decremented, but des isn't, it seems that stmmac_tso_allocator() would not put all of the buffer in the descriptors and would leave the last TSO_MAX_BUFF_SIZE bytes out? I highlight all of this because with your change here we get the following now in the <= 32 case: first->des0 = des first->des1 = des + proto_hdr_len stmmac_tso_allocator(priv, des + proto_hdr_len, ...) which is a subtle change in the call to stmmac_tso_allocator, meaning a subtle change in the descriptor programming. Both seem wrong for the <= 32 case, but I'm "reading between the lines" with how these descriptors are programmed (I don't have the docs to back this up, I'm inferring from the code). It seems to me that in the <= 32 case we should have: first->des0 = des first->des1 = des + proto_hdr_len stmmac_tso_allocator(priv, des + TSO_MAX_BUF_SIZE, ...) or similar depending on if that really makes sense with how des0/des1 is used (the handling is different in stmmac_tso_allocator() for <= 32, only des0 is used so I'm having a tough time figuring out how much of the des is actually programmed in des0 + des1 above without knowing the hardware better). Does that make sense? The prior code seems fishy to me, your change seems to unintentionally change that fhsy part, but it still seems fishy to me. I don't think you should be changing that code's behavior in that patch, if you think it's right then we should continue with the current behavior prior to your patch, and if you think its wrong we should probably fix that *prior* to this patch in your series. Thanks, Andrew