On Mon, Mar 17, 2025 at 01:52:15PM +0800, Miaoqing Pan wrote: > On 3/14/2025 4:06 PM, Johan Hovold wrote: > > On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote: > >> I think the hardware has already ensured synchronization between > >> descriptor and head pointer, which isn't difficult to achieve. The issue > >> is likely caused by something else and requires further debugging. > > > > Yeah, but you still need memory barriers on the kernel side. > > > > It could be that we are looking at two different causes for those > > zero-length descriptors. > > > > The error handling for that obviously needs to be fixed either way, but > > I haven't heard anyone hitting the corruption with the memory barriers > > in place on the X13s yet (even if we'd need some more time to test > > this). > After multiple and prolonged verifications, adding dma_rmb() did not > improve the issue at all. I think this Status Descriptor is updated by > hardware (Copy Engine) controlled by another system, not involving DMA > or out-of-order CPU access within the same system, so memory barriers do > not take effect. Then it seems we are looking at two separate root causes for the corruption as the memory barrier appears to be all that is needed to fix the X13s issue. A user who hit the corruption after 2 h without the fix has been running over the weekend with the memory barrier without any problems. I'll ask further users to test, but it certainly looks like it is working as intended. And the memory barrier is de-facto missing as the head pointer and descriptor are accessed through (two separate) coherent mappings so there are no ordering guarantees without explicit barriers. Now obviously there are further issues in your system, which we should make sure we understand before adding workarounds to the driver. Do you have a pointer to the downstream kernel sources you are testing with? Or even better, can you reproduce the issue with mainline after adding the PCIe patches that were posted to the lists for these platforms? Apparently the descriptors can also be passed in non-coherent memory for some devices (.alloc_cacheable_memory / HAL_SRNG_FLAGS_CACHED). That implementation looks suspicious and could possibly result in similar problems. Are you using .alloc_cacheable_memory in your setup? Does it make any difference if you use a full rmb() barrier? And after modifying ath11k_hal_ce_dst_status_get_length() so that it does not clear the length, how many times you need to retry? Does it always work on the second try? Johan