Re: [RFC PATCH 1/4] usb: xhci: Synopsys xHC consolidate TRBs

Tejas Joglekar <Tejas.Joglekar@xxxxxxxxxxxx> · Mon, 16 Mar 2020 17:09:40 +0000



Hi,
On 3/3/2020 8:17 PM, Mathias Nyman wrote:
> On 3.3.2020 12.24, Tejas Joglekar wrote:
>> Hi,
>> On 2/14/2020 2:06 PM, Tejas Joglekar wrote:
>>> Hi,
>>> On 2/12/2020 8:34 PM, Mathias Nyman wrote:
>>>> On 7.2.2020 19.17, Tejas Joglekar wrote:
>>>>> Hi,
>>>>> Thanks for the review comments.
>>>>> On 1/3/2020 10:14 PM, Mathias Nyman wrote:
>>>>>> On 20.12.2019 15.39, Tejas Joglekar wrote:
>>>>>>> The Synopsys xHC has an internal TRB cache of size TRB_CACHE_SIZE for
>>>>>>> each endpoint. The default value for TRB_CACHE_SIZE is 16 for SS and 8
>>>>>>> for HS. The controller loads and updates the TRB cache from the transfer
>>>>>>> ring in system memory whenever the driver issues a start transfer or
>>>>>>> update transfer command.
>>>>>>>
>>>>>>> For chained TRBs, the Synopsys xHC requires that the total amount of
>>>>>>> bytes for all TRBs loaded in the TRB cache be greater than or equal to 1
>>>>>>> MPS. Or the chain ends within the TRB cache (with a last TRB).
>>>>>>>
>>>>>>> If this requirement is not met, the controller will not be able to send
>>>>>>> or receive a packet and it will hang causing a driver timeout and error.
>>>>>>>
>>>>>>> This can be a problem if a class driver queues SG requests with many
>>>>>>> small-buffer entries. The XHCI driver will create a chained TRB for each
>>>>>>> entry which may trigger this issue.
>>>>>>>
>>>>>>> This patch adds logic to the XHCI driver to detect and prevent this from
>>>>>>> happening.
>>>>>>>
>>>>>>> For every (TRB_CACHE_SIZE - 2) TRBs, we check the total buffer size of
>>>>>>> the TRBs and if the chain continues and we don't make up at least 1 MPS,
>>>>>>> we create a bounce buffer to consolidate up to the next (4 * MPS) TRBs
>>>>>>> into the last TRB.
>>>>>>>
>>>>>>> We check at (TRB_CACHE_SIZE - 2) because it is possible that there would
>>>>>>> be a link and/or event data TRB that take up to 2 of the cache entries.
>>>>>>> And we consolidate the next (4 * MPS) to improve performance.
>>>>>>>
>>>>>>> We discovered this issue with devices on other platforms but have not
>>>>>>> yet come across any device that triggers this on Linux. But it could be
>>>>>>> a real problem now or in the future. All it takes is N number of small
>>>>>>> chained TRBs. And other instances of the Synopsys IP may have smaller
>>>>>>> values for the TRB_CACHE_SIZE which would exacerbate the problem.
>>>>>>>
>>>>>>> We verified this patch using our internal driver and testing framework.
>>>>>>
>>>>>>
>>>>> I understand that in a first look it looks a complex solution, but you can suggest
>>>>> the modifications/changes which would be required to make the patch more readable.
>>>>> I have tried to keep the implementation similar to bounce buffer implementation 
>>>>> only with addition of bounce buffer list. For the spinlock case, you can take a 
>>>>> call if it is required or not.
>>>>
>>>> In your case you know the need for a bounce buffer much earlier than in the
>>>> existing TD fragment case.
>>>>
>>>> Have you looked into the struct hc_driver map_urb_for_dma() and
>>>> unmap_urb_for_dma() hooks? In those you could detect the need for a bounce
>>>> buffer, allocate it, and bluntly copy entire scattergather buffer to the 
>>>> bounce buffer. It should be fairly small anyway.
>>>>
>>> I will look into it, and get back to you. Thanks for the suggestion.
>>>  
>> I looked into it and I have a question related to approach you have suggested.
>> I can detect need for the bounce buffer in map_urb_for_dma() function and can allocate bounce
>> buffer and bluntly copy it, but when the SG list is having data over few MB's, I think 
>> my bounce buffer allocation might fail, over a period. Do you think it is possible?
> 
> Doesn't sound very likely, the sg would need to have more than 16 entries of which the
> total length of consecutive 16 entries is less than 1024 bytes, and then the following
> entries should be large enough to fail memory allocation. 
> (for HS the total legth < 512 for 8 entries, and rest huge)  
> 
> And no real world device has yet even triggered the first condition of having 16 (8)
> sg entries with total length less than max packet size.
> 
>>
>> So to avoid that, I thought of having a list of such bounce buffers held with the URB and 
>> then I can break the SG list with some fixed data length (e.g 16KB or 32 KB's) bounce buffers
>> and copy the SG into that bounce buffer list.
>>
>> Another option is to create a bounce sg, based on detection of bounce buffer requirement, 
>> where the small size SG's are combined to create a new SG list which can satisfy the 
>> TRB cache requirement for the SNPS controller.
>>
>> What do you suggest? Which is the best way to go about solving the problem?
> 
> As this is extremely rare (never reported in a real use case) 
> I'd keep this as simple as possible. Allocate one bounce buffer when needed, and copy all.
> If memory allocation fails (unlikely) print a warning, then we immediately know what to fix.
> 
> We are talking about sizes where 16 sg entries have in total 1024 bytes of data.
> 
Ok, I agree.
> ehci-tegra.c does something related to what you want. It replaces
> urb->transfer_buffer with a dma aligned bounce buffer. 
> 
I tried this (in hcd.c) and it seems to be working with my tests. I want to override the 
unmap_urb_for_dma() and map_urb_for_dma() and keep all other things same as xhci driver.
I thought to use xhci_driver_override for doing this instead of creating a whole new glue
driver. Would that be ok ?

> -Mathias
>