On 7.2.2020 19.17, Tejas Joglekar wrote: > Hi, > Thanks for the review comments. > On 1/3/2020 10:14 PM, Mathias Nyman wrote: >> On 20.12.2019 15.39, Tejas Joglekar wrote: >>> The Synopsys xHC has an internal TRB cache of size TRB_CACHE_SIZE for >>> each endpoint. The default value for TRB_CACHE_SIZE is 16 for SS and 8 >>> for HS. The controller loads and updates the TRB cache from the transfer >>> ring in system memory whenever the driver issues a start transfer or >>> update transfer command. >>> >>> For chained TRBs, the Synopsys xHC requires that the total amount of >>> bytes for all TRBs loaded in the TRB cache be greater than or equal to 1 >>> MPS. Or the chain ends within the TRB cache (with a last TRB). >>> >>> If this requirement is not met, the controller will not be able to send >>> or receive a packet and it will hang causing a driver timeout and error. >>> >>> This can be a problem if a class driver queues SG requests with many >>> small-buffer entries. The XHCI driver will create a chained TRB for each >>> entry which may trigger this issue. >>> >>> This patch adds logic to the XHCI driver to detect and prevent this from >>> happening. >>> >>> For every (TRB_CACHE_SIZE - 2) TRBs, we check the total buffer size of >>> the TRBs and if the chain continues and we don't make up at least 1 MPS, >>> we create a bounce buffer to consolidate up to the next (4 * MPS) TRBs >>> into the last TRB. >>> >>> We check at (TRB_CACHE_SIZE - 2) because it is possible that there would >>> be a link and/or event data TRB that take up to 2 of the cache entries. >>> And we consolidate the next (4 * MPS) to improve performance. >>> >>> We discovered this issue with devices on other platforms but have not >>> yet come across any device that triggers this on Linux. But it could be >>> a real problem now or in the future. All it takes is N number of small >>> chained TRBs. And other instances of the Synopsys IP may have smaller >>> values for the TRB_CACHE_SIZE which would exacerbate the problem. >>> >>> We verified this patch using our internal driver and testing framework. >> >> > I understand that in a first look it looks a complex solution, but you can suggest > the modifications/changes which would be required to make the patch more readable. > I have tried to keep the implementation similar to bounce buffer implementation > only with addition of bounce buffer list. For the spinlock case, you can take a > call if it is required or not. In your case you know the need for a bounce buffer much earlier than in the existing TD fragment case. Have you looked into the struct hc_driver map_urb_for_dma() and unmap_urb_for_dma() hooks? In those you could detect the need for a bounce buffer, allocate it, and bluntly copy entire scattergather buffer to the bounce buffer. It should be fairly small anyway. -Mathias