> -----Original Message----- > From: Arnd Bergmann <arnd@xxxxxxxx> > Sent: Wednesday, October 6, 2021 3:33 PM > To: Pkshih <pkshih@xxxxxxxxxxx> > Cc: Arnd Bergmann <arnd@xxxxxxxx>; Kalle Valo <kvalo@xxxxxxxxxxxxxx>; > linux-wireless@xxxxxxxxxxxxxxx > Subject: Re: [PATCH v6 03/24] rtw89: add core and trx files > > On Wed, Oct 6, 2021 at 3:35 AM Pkshih <pkshih@xxxxxxxxxxx> wrote: > > > > > > > > Compare the object codes side-by-side, they are almost the same except > > > > to some instructions. I think this is because the inline function > > > > I apply __always_inline contains only a simple statement. > > > > > > Ok. Did you check the output for the configuration that showed the > > > problem as well, after adding __always_inline? There are certain > > > compile-time options that could cause the code to become unoptimized, > > > e.g. KASAN, in addition to the OPTIMIZE_FOR_SIZE. > > > > Summarize object code size of the combinations: > > > > ccflag default -Os > > ====== ======= ============= > > inline 0x1AF X > > always_inline 0x1AA 0x1A4 > > > > With default ccflag, the difference of inline and always_inline is a > > je/jne instruction for 'if (!desc_info->en_wd_info)'. The always_inline > > doesn't affect the part that use RTW89_SET_TXWD(). > > > > Compare always_inline row, the case of default ccflag uses movzbl (4 bytes), > > but -Os case uses mov (3 bytes). > > > > By the results, -Os affect the object code size. always_inline doesn't > > affect the code, but affect the instruction (je/jne) nearby. > > Those are the known-good cases, yes. > > > I use Ubuntun kernel that doesn't enable KASAN. > > # CONFIG_KASAN is not set > > Ah, so you test using the driver backports package on a distro > kernel? While this may be a good option for your development > needs, I think it is generally a good idea to also be able to test > your patches against the latest mainline or linux-next kernel > directly, if only to ensure that there are no obvious regressions. No, I don't use backport. I use Ubuntu kernel PPA [1] to upgrade my kernel regularly. So, it is almost the latest version. > > > > > > +#define RTW89_SET_TXWD_BODY_WP_OFFSET(txdesc, val) \ > > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, GENMASK(31, 24)) > > > > > +#define RTW89_SET_TXWD_BODY_MORE_DATA(txdesc, val) \ > > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(23)) > > > > > +#define RTW89_SET_TXWD_BODY_WD_INFO_EN(txdesc, val) \ > > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(22)) > > > > > +#define RTW89_SET_TXWD_BODY_FW_DL(txdesc, val) \ > > > > > + RTW89_SET_TXWD(txdesc, val, 0x00, BIT(20)) > > > > > > > > > > I would personally write this without the wrappers, instead defining the > > > > > bitmask macros as the masks and then open-coding the > > > > > le32p_replace_bits() calls instead, which I would find more > > > > > intuitive while it avoids the problem with the bitmasks. > > > > > > > > Use these macros can address offset and bit fields quickly. > > > > How about I use macro instead of inline function? Like, > > > > > > > > #define RTW89_SET_TXWD (txdesc, val, offset, mask) \ > > > > do { \ > > > > u32 *txd32 = (u32 *)txdesc; \ > > > > le32p_replace_bits((__le32 *)(txd32 + offset), val, mask); \ > > > > } while (0) > > > > > > That would obviously address the immediate bug, but I think > > > using le32p_replace_bits() directly here would actually be > > > more readable, after you define the descriptor layout using > > > a structure with named __le32 members to replace the offset. > > > > I will remove the wrapper and use le32p_replace_bits() directly. > > > > I don't plan to use structure, because these data contain bit-fields. > > Then, I need to maintain little-/big-endian formats, like > > > > struct foo { > > #if BIG_ENDINA > > __le32 msb:1; > > __le32 rsvd:30; > > __le32 lsb:1; > > #else > > __le32 lsb:1; > > __le32 rsvd:30; > > __le32 msb:1; > > #endif > > }; > > Right, bitfields would not work well here, as they are generally not > portable. Using an "#ifdef __BIG_ENDIAN_BITFIELD" check can > work, but as you say this is really ugly. > > What I was trying to suggest instead is a structure like > > struct descriptor { > __le32 word0; > __le32 word1; > __le32 word2; > __le32 word3; > }; > > And then build the descriptor like (with proper naming of the fields of course) > > void fill_descriptor(struct my_device *dev, struct sk_buff *skb, > volatile struct descriptor *d) > { > d->word0 = build_desc_word0(fieldA, fieldB, fieldC, fieldD); > d->word1 = build_desc_word1(fieldE, fieldF); > ... > } > > where the build_desc_word0() functions are the ones that encode the > actual layout, e.g. using the linux/bitfield.h helpers like > > static inline __le32 build_desc_word0(u32 fieldA, u32 fieldB, u32 > fieldC, u32 fieldD) > { > u32 word = FIELD_PREP(REG_FIELD_A, fieldA) | > FIELD_PREP(REG_FIELD_B, fieldB) | > FIELD_PREP(REG_FIELD_C, fieldC) | > FIELD_PREP(REG_FIELD_D, fieldD); > > return cpu_to_le32(word); > } > > Doing it this way has the advantage of keeping the assignment > separate, which makes sure you don't accidentally introduce > a read-modify-write cycle on the descriptor. This should work > well on all architectures using dma_alloc_coherent() buffers. Got it. > > > > > > Going back one more step, I see that that rtw89_core_fill_txdesc() > > > > > manipulates the descriptor fields in-memory, which also seems > > > > > like a bad idea: The descriptor is mapped as cache-coherent, > > > > > so on machines with no coherent DMA (i.e. most ARM or MIPS > > > > > machines), that is uncached memory, and writing the descriptor > > > > > using a series of read-modify-write cycles on uncached memory > > > > > will be awfully slow. Maybe the answer is to just completely > > > > > replace the descriptor access. > > > > > > > > I'll think if we can use chached memory with single_map/unmap for > > > > descriptor. That would improve the performance. > > > > > > Using dma_unmap_single() with its cache flush may not work > > > correctly if the descriptor fields have to be written in a particular > > > order. Usually the last field in a descriptor contains a 'valid' > > > bit that must not be observed by the hardware before the rest > > > is visible. The cache flush however would not guarantee the > > > order of the update. > > > > Is it possible to flush cache twice? Writing the fields other > > than 'valid' bit, and do wmb() and first flush. Then, set 'valid' bit, > > and do second flush. > > This could work, but it would be really expensive, since the > dma-mapping API is based on ownership state transitions, so > you'd have to got through dma_sync_single_for_device(), > dma_sync_single_for_cpu(), and another > dma_sync_single_for_device(). On machines using swiotlb(), > those would in turn translate into copy operations. > > > > It would also likely be slower than dma_alloc_coherent() on > > > machines that have cache-coherent PCI, such as most x86. > > > > > > The best way is usually to construct the descriptor one word > > > at a time in registers, and write that word using WRITE_ONCE(), > > > with an explict dma_wmb() before the final write that makes > > > the descriptor valid. > > > > > > > Thanks for the guideline. > > > > Fortunately, descriptor of this hardware uses circular ring buffer with > > read/write index instead of 'valid' bit. To issue a packet with descriptor > > to hardware, we fill descriptor and fill address of skb as well, and then > > update write index (a register) to trigger hardware to start DMA this > > packet. So, I think it is possible to use dma_map_single(). > > > > Anyway, I will try both methods later. > > If you end up with the streaming mapping, I would suggest using a > single dma_alloc_noncoherent(), followed by dma_sync_single_* > later on, rather than multiple map/unmap calls that would need to > reprogram the IOMMU. The coherent API as I explained above > should be more efficient though, unless you need to do a lot of > reads from the descriptors. > OK. I will try dma_alloc_noncoherent(), and measure the performance. But, it seems like you have told me the answer now. Thanks again for your rich guideline. [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/ -- Ping-Ke