On 9/24/2024 10:06 PM, Andrew Halaney wrote: > On Tue, Sep 24, 2024 at 04:36:59PM GMT, Sarosh Hasan wrote: >> >> >> On 9/10/2024 7:34 PM, Andrew Halaney wrote: >>> Hey Suraj, >>> >>> Your email client didn't seem to quote my response in your latest reply, >>> so its difficult to figure out what you're writing vs me below. It also >>> seems to have messed with the line breaks so I'm manually redoing those. >>> >>> Please see if you can figure out how to make that happen for further >>> replies! >>> >>> More comments below... >>> >>> On Tue, Sep 10, 2024 at 12:47:08PM GMT, Suraj Jaiswal wrote: >>>> >>>> >>>> -----Original Message----- >>>> From: Andrew Halaney <ahalaney@xxxxxxxxxx> >>>> Sent: Wednesday, September 4, 2024 3:47 AM >>>> To: Suraj Jaiswal (QUIC) <quic_jsuraj@xxxxxxxxxxx> >>>> Cc: Vinod Koul <vkoul@xxxxxxxxxx>; bhupesh.sharma@xxxxxxxxxx; Andy Gross <agross@xxxxxxxxxx>; Bjorn Andersson <andersson@xxxxxxxxxx>; Konrad Dybcio <konrad.dybcio@xxxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>; Jakub Kicinski <kuba@xxxxxxxxxx>; Rob Herring <robh+dt@xxxxxxxxxx>; Krzysztof Kozlowski <krzysztof.kozlowski+dt@xxxxxxxxxx>; Conor Dooley <conor+dt@xxxxxxxxxx>; Alexandre Torgue <alexandre.torgue@xxxxxxxxxxx>; Jose Abreu <joabreu@xxxxxxxxxxxx>; Maxime Coquelin <mcoquelin.stm32@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-arm-msm@xxxxxxxxxxxxxxx; devicetree@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-stm32@xxxxxxxxxxxxxxxxxxxxxxxxxxxx; Prasad Sodagudi <psodagud@xxxxxxxxxxx>; Rob Herring <robh@xxxxxxxxxx>; kernel <kernel@xxxxxxxxxxx> >>>> Subject: Re: [PATCH net] net: stmmac: Stop using a single dma_map() for multiple descriptors >>>> >>>> WARNING: This email originated from outside of Qualcomm. Please be wary of any links or attachments, and do not enable macros. >>>> >>>> On Mon, Sep 02, 2024 at 03:24:36PM GMT, Suraj Jaiswal wrote: >>>>> Currently same page address is shared >>>>> between multiple buffer addresses and causing smmu fault for other >>>>> descriptor if address hold by one descriptor got cleaned. >>>>> Allocate separate buffer address for each descriptor for TSO path so >>>>> that if one descriptor cleared it should not clean other descriptor >>>>> address. >>> >>> snip... >>> >>>>> >>>>> static void stmmac_flush_tx_descriptors(struct stmmac_priv *priv, int >>>>> queue) @@ -4351,25 +4380,17 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev) >>>>> pay_len = 0; >>>>> } >>>>> >>>>> - stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue); >>>>> + if (stmmac_tso_allocator(priv, (skb->data + proto_hdr_len), >>>>> + tmp_pay_len, nfrags == 0, queue, false)) >>>>> + goto dma_map_err; >>>> >>>> Changing the second argument here is subtly changing the dma_cap.addr64 <= 32 >>>> case right before this. Is that intentional? >>>> >>>> i.e., prior, pretend des = 0 (side note but des is a very confusing variable >>>> name for "dma address" when there's also mentions of desc meaning "descriptor" >>>> in the DMA ring). In the <= 32 case, we'd call stmmac_tso_allocator(priv, 0) >>>> and in the else case we'd call stmmac_tso_allocator(priv, 0 + proto_hdr_len). >>>> >>>> With this change in both cases its called with the (not-yet-dma-mapped) >>>> skb->data + proto_hdr_len always (i.e. like the else case). >>>> >>>> Honestly, the <= 32 case reads weird to me without this patch. It seems some >>>> of the buffer is filled but des is not properly incremented? >>>> >>>> I don't know how this hardware is supposed to be programmed (no databook >>>> access) but that seems fishy (and like a separate bug, which would be nice to >>>> squash if so in its own patch). Would you be able to explain the logic there >>>> to me if it does make sense to you? >>>> >>> >>>> <Suraj> des can not be 0 . des 0 means dma_map_single() failed and it will return. >>>> If we see if des calculation (first->des1 = cpu_to_le32(des + proto_hdr_len);) >>>> and else case des calculator ( des += proto_hdr_len;) it is adding proto_hdr_len >>>> to the memory that we after mapping skb->data using dma_map_single. >>>> Same way we added proto_hdr_len in second argument . >>> >>> >>> 0 was just an example, and a confusing one, sorry. Let me paste the original >>> fishy code that I think you've modified the behavior for. Here's the >>> original: >>> >>> if (priv->dma_cap.addr64 <= 32) { >>> first->des0 = cpu_to_le32(des); >>> >>> /* Fill start of payload in buff2 of first descriptor */ >>> if (pay_len) >>> first->des1 = cpu_to_le32(des + proto_hdr_len); >>> >>> /* If needed take extra descriptors to fill the remaining payload */ >>> tmp_pay_len = pay_len - TSO_MAX_BUFF_SIZE; >>> } else { >>> stmmac_set_desc_addr(priv, first, des); >>> tmp_pay_len = pay_len; >>> des += proto_hdr_len; >>> pay_len = 0; >>> } >>> >>> stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue); >>> >>> Imagine the <= 32 case. Let's say des is address 0 (just for simplicity >>> sake, let's assume that's valid). That means: >>> >>> first->des0 = des; >>> first->des1 = des + proto_hdr_len; >>> stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue) >>> >>> if des is 0, proto_hdr_len is 64, then that means >>> >>> first->des0 = 0 >>> first->des1 = 64 >>> stmmac_tso_allocator(priv, 0, tmp_pay_len, (nfrags == 0), queue) >>> >>> That seems fishy to me. We setup up the first descriptor with the >>> beginning of des, and then the code goes and sets up more descriptors >>> (stmmac_tso_allocator()) starting with the same des again? >> tso_alloc is checking if more descriptor needed for packet . it is adding offset to get next >> descriptor (curr_addr = des + (total_len - tmp_len)) and storing in des of next descriptor. > > Yes, so in stmmac_tso_allocator() we currently have: > > static void stmmac_tso_allocator(struct stmmac_priv *priv, dma_addr_t des, > int total_len, bool last_segment, u32 queue) > { > struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue]; > struct dma_desc *desc; > u32 buff_size; > int tmp_len; > > tmp_len = total_len; > > while (tmp_len > 0) { > dma_addr_t curr_addr; > > tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, > priv->dma_conf.dma_tx_size); > ... > curr_addr = des + (total_len - tmp_len); > if (priv->dma_cap.addr64 <= 32) > desc->des0 = cpu_to_le32(curr_addr); > > so on the first loop you've got: > tmp_len = total_len > ... > curr_addr = des + total_len - temp_len > i.e. > curr_addr = des > meaning with the "first" handling I've highlighted we've got > first->des0 = des > "next"->des0 = des > > where "next" is the cur_tx descriptor in the first loop of > stmmac_tso_allocator() (essentially the second descriptor). > That seems broken to me, and was that way prior to this patch. > able to verify for DMA mask < 32 and > 32 . We will update final patch by taking cares of others commnet > You've modified the behavior in this patch unintentionally. I think it > needs modifying, but it should be done so explicitly in its own patch > prior to this one. I also think the current modification in this patch > isn't a fix. See prior reply below where I highlighted the programming as I > understand it with this patch applied, which would result in something > like. > > first->des0 = des > first->des1 = des + proto_hdr_len > "next"->des0 = des + proto_hdr_len > > Which again seems wrong, two descriptors pointing to the same address > isn't making sense to me. > > Sorry to sound like a broken record, but I want to make sure we're on > the same page! Sounds like you're looking into it based on the below > comment, but some of these comments here made me think I didn't explain > the situation well enough. > >>> >>> Should we be adding the payload length (TSO_MAX_BUFF_SIZE I suppose >>> based on the tmp_pay_len = pay_len - TSO_MAX_BUFFSIZE above)? It seems >>> that <= 32 results in duplicate data in both the "first" descriptor >>> programmed above, and in the "second" descriptor programmed in >>> stmmac_tso_allocator(). >> curr_addr = des + (total_len - tmp_len) is used in while loop in tso_alloc to get address of all required descriptor . >> descriptor address will be updated finally in tso_alloc by below call . >> >> if (priv->dma_cap.addr64 <= 32) >> desc->des0 = cpu_to_le32(curr_addr); >> else >> stmmac_set_desc_addr(priv, desc, curr_addr); >> >> Also, since tmp_pay_len is decremented, but des >>> isn't, it seems that stmmac_tso_allocator() would not put all of the >>> buffer in the descriptors and would leave the last TSO_MAX_BUFF_SIZE >>> bytes out? >>> >>> I highlight all of this because with your change here we get the >>> following now in the <= 32 case: >>> >>> first->des0 = des >>> first->des1 = des + proto_hdr_len >>> stmmac_tso_allocator(priv, des + proto_hdr_len, ...) >>> >>> which is a subtle change in the call to stmmac_tso_allocator, meaning >>> a subtle change in the descriptor programming. >>> >>> Both seem wrong for the <= 32 case, but I'm "reading between the lines" >>> with how these descriptors are programmed (I don't have the docs to back >>> this up, I'm inferring from the code). It seems to me that in the <= 32 >>> case we should have: >>> >>> first->des0 = des >>> first->des1 = des + proto_hdr_len >>> stmmac_tso_allocator(priv, des + TSO_MAX_BUF_SIZE, ...) >> >> let me check <=32 case only on setup and get back. >>> >>> or similar depending on if that really makes sense with how des0/des1 is >>> used (the handling is different in stmmac_tso_allocator() for <= 32, >>> only des0 is used so I'm having a tough time figuring out how much of >>> the des is actually programmed in des0 + des1 above without knowing the >>> hardware better). >>> >>> Does that make sense? The prior code seems fishy to me, your change >>> seems to unintentionally change that fhsy part, but it still seems fishy >>> to me. I don't think you should be changing that code's behavior in that >>> patch, if you think it's right then we should continue with the current >>> behavior prior to your patch, and if you think its wrong we should >>> probably fix that *prior* to this patch in your series. >>> >>> Thanks, >>> Andrew >>> >> >