Hi, On 3/3/2020 8:17 PM, Mathias Nyman wrote: > On 3.3.2020 12.24, Tejas Joglekar wrote: >> Hi, >> On 2/14/2020 2:06 PM, Tejas Joglekar wrote: >>> Hi, >>> On 2/12/2020 8:34 PM, Mathias Nyman wrote: >>>> On 7.2.2020 19.17, Tejas Joglekar wrote: >>>>> Hi, >>>>> Thanks for the review comments. >>>>> On 1/3/2020 10:14 PM, Mathias Nyman wrote: >>>>>> On 20.12.2019 15.39, Tejas Joglekar wrote: >>>>>>> The Synopsys xHC has an internal TRB cache of size TRB_CACHE_SIZE for >>>>>>> each endpoint. The default value for TRB_CACHE_SIZE is 16 for SS and 8 >>>>>>> for HS. The controller loads and updates the TRB cache from the transfer >>>>>>> ring in system memory whenever the driver issues a start transfer or >>>>>>> update transfer command. >>>>>>> >>>>>>> For chained TRBs, the Synopsys xHC requires that the total amount of >>>>>>> bytes for all TRBs loaded in the TRB cache be greater than or equal to 1 >>>>>>> MPS. Or the chain ends within the TRB cache (with a last TRB). >>>>>>> >>>>>>> If this requirement is not met, the controller will not be able to send >>>>>>> or receive a packet and it will hang causing a driver timeout and error. >>>>>>> >>>>>>> This can be a problem if a class driver queues SG requests with many >>>>>>> small-buffer entries. The XHCI driver will create a chained TRB for each >>>>>>> entry which may trigger this issue. >>>>>>> >>>>>>> This patch adds logic to the XHCI driver to detect and prevent this from >>>>>>> happening. >>>>>>> >>>>>>> For every (TRB_CACHE_SIZE - 2) TRBs, we check the total buffer size of >>>>>>> the TRBs and if the chain continues and we don't make up at least 1 MPS, >>>>>>> we create a bounce buffer to consolidate up to the next (4 * MPS) TRBs >>>>>>> into the last TRB. >>>>>>> >>>>>>> We check at (TRB_CACHE_SIZE - 2) because it is possible that there would >>>>>>> be a link and/or event data TRB that take up to 2 of the cache entries. >>>>>>> And we consolidate the next (4 * MPS) to improve performance. >>>>>>> >>>>>>> We discovered this issue with devices on other platforms but have not >>>>>>> yet come across any device that triggers this on Linux. But it could be >>>>>>> a real problem now or in the future. All it takes is N number of small >>>>>>> chained TRBs. And other instances of the Synopsys IP may have smaller >>>>>>> values for the TRB_CACHE_SIZE which would exacerbate the problem. >>>>>>> >>>>>>> We verified this patch using our internal driver and testing framework. >>>>>> >>>>>> >>>>> I understand that in a first look it looks a complex solution, but you can suggest >>>>> the modifications/changes which would be required to make the patch more readable. >>>>> I have tried to keep the implementation similar to bounce buffer implementation >>>>> only with addition of bounce buffer list. For the spinlock case, you can take a >>>>> call if it is required or not. >>>> >>>> In your case you know the need for a bounce buffer much earlier than in the >>>> existing TD fragment case. >>>> >>>> Have you looked into the struct hc_driver map_urb_for_dma() and >>>> unmap_urb_for_dma() hooks? In those you could detect the need for a bounce >>>> buffer, allocate it, and bluntly copy entire scattergather buffer to the >>>> bounce buffer. It should be fairly small anyway. >>>> >>> I will look into it, and get back to you. Thanks for the suggestion. >>> >> I looked into it and I have a question related to approach you have suggested. >> I can detect need for the bounce buffer in map_urb_for_dma() function and can allocate bounce >> buffer and bluntly copy it, but when the SG list is having data over few MB's, I think >> my bounce buffer allocation might fail, over a period. Do you think it is possible? > > Doesn't sound very likely, the sg would need to have more than 16 entries of which the > total length of consecutive 16 entries is less than 1024 bytes, and then the following > entries should be large enough to fail memory allocation. > (for HS the total legth < 512 for 8 entries, and rest huge) > > And no real world device has yet even triggered the first condition of having 16 (8) > sg entries with total length less than max packet size. > >> >> So to avoid that, I thought of having a list of such bounce buffers held with the URB and >> then I can break the SG list with some fixed data length (e.g 16KB or 32 KB's) bounce buffers >> and copy the SG into that bounce buffer list. >> >> Another option is to create a bounce sg, based on detection of bounce buffer requirement, >> where the small size SG's are combined to create a new SG list which can satisfy the >> TRB cache requirement for the SNPS controller. >> >> What do you suggest? Which is the best way to go about solving the problem? > > As this is extremely rare (never reported in a real use case) > I'd keep this as simple as possible. Allocate one bounce buffer when needed, and copy all. > If memory allocation fails (unlikely) print a warning, then we immediately know what to fix. > > We are talking about sizes where 16 sg entries have in total 1024 bytes of data. > Ok, I agree. > ehci-tegra.c does something related to what you want. It replaces > urb->transfer_buffer with a dma aligned bounce buffer. > I tried this (in hcd.c) and it seems to be working with my tests. I want to override the unmap_urb_for_dma() and map_urb_for_dma() and keep all other things same as xhci driver. I thought to use xhci_driver_override for doing this instead of creating a whole new glue driver. Would that be ok ? > -Mathias >