RE: interconnects on Tegra

Sanjay Chandrashekara <sanjayc@xxxxxxxxxx> · Tue, 4 Dec 2018 21:17:49 +0000

Hi Georgi,

Please find my question inline marked with [Sanjay]

-----Original Message-----
From: Georgi Djakov <georgi.djakov@xxxxxxxxxx> 
Sent: Monday, November 19, 2018 7:38 AM
To: Krishna Sitaraman <ksitaraman@xxxxxxxxxx>; Thierry Reding <thierry.reding@xxxxxxxxx>
Cc: Jonathan Hunter <jonathanh@xxxxxxxxxx>; Peter De Schrijver <pdeschrijver@xxxxxxxxxx>; Dmitry Osipenko <digetx@xxxxxxxxx>; linux-tegra <linux-tegra@xxxxxxxxxxxxxxx>; Sanjay Chandrashekara <sanjayc@xxxxxxxxxx>; Linux PM list <linux-pm@xxxxxxxxxxxxxxx>
Subject: Re: interconnects on Tegra

Hi Thierry and Krishna,

On 11/17/18 01:16, Krishna Sitaraman wrote:
> My responses inline. (I am using outlook and currently my responses 
> are marked with [Krishna S].  In near future I will switch to a more 
> linux friendly indentation.  So please bear with me till then. )
> 
> -----Original Message----- From: Thierry Reding 
> <thierry.reding@xxxxxxxxx> Sent: Friday, November 16, 2018 3:54 AM
> To: Georgi Djakov <georgi.djakov@xxxxxxxxxx> Cc: Krishna Sitaraman 
> <ksitaraman@xxxxxxxxxx>; Jonathan Hunter <jonathanh@xxxxxxxxxx>; Peter 
> De Schrijver <pdeschrijver@xxxxxxxxxx>; Dmitry Osipenko 
> <digetx@xxxxxxxxx>; linux-tegra <linux-tegra@xxxxxxxxxxxxxxx>; Sanjay 
> Chandrashekara <sanjayc@xxxxxxxxxx>; Linux PM list 
> <linux-pm@xxxxxxxxxxxxxxx> Subject: Re: interconnects on Tegra
> 
> On Fri, Nov 16, 2018 at 01:06:46PM +0200, Georgi Djakov wrote:
>> Hi,
>> 
>> Adding linux-pm list.
>> 
>> On 11/15/18 18:47, Thierry Reding wrote:
>>> On Thu, Nov 15, 2018 at 01:17:58AM +0000, Krishna Sitaraman
>>> wrote:
>>>> Thierry, thanks for looping us in. Reading this thread, I believe 
>>>> Peter has given a brief overview of how we manage memory bus clock 
>>>> along with latency & priority information to the memory subsystem 
>>>> for isochronous clients.  I want to add to
>>>> this: -	The total amount of isochronous bandwidth achievable is
>>>> limited and we have to arbitrate the available iso bandwidth at 
>>>> runtime among the clients.  So our tegra framework provides a 
>>>> mechanism for clients to check and then lock a particular iso 
>>>> bandwidth before attempting to switch to the desired mode which
>>>> uses it. -	The tegra framework also provides mechanism for the
>>>> isochronous clients to set a latency and/or priority information to 
>>>> the memory arbitration hardware.
>>>> 
>>>> The interconnect framework seems to be a good fit and we might be 
>>>> able to make use of it.  However there are certain additional 
>>>> functionality we want to request or suggest that can help in using 
>>>> the interconnects.
>> 
>> Thanks for the comments!
>> 
>>>> 
>>>> Listing them out here:
>>>> 
>>>> 1.	Isochronous bandwidth manager needs to provide feedback to
>>>> the clients (consumers) to know if a particular iso bandwidth 
>>>> request is possible or not before clients can make a definite 
>>>> switch. Example Display wanting to know if a mode is possible 
>>>> before switch to the new configuration.  The interconnect framework 
>>>> needs a method for provider to give a feedback to the requesting 
>>>> consumer.  A check or is_possible request before actual set 
>>>> request.
>>> 
>>> How would we handle races with other bandwidth requests? Let's say 
>>> the display driver requests a check for a certain bandwidth, then 
>>> gets back a positive reply, after which it would go make the actual 
>>> request, but at that point somebody else might have already 
>>> requested additional bandwidth, in turn making the actual request 
>>> fail.
>>> 
>>> Would it perhaps be a good idea to just make the actual request fail 
>>> if there's not enough bandwidth and allocate the requested bandwidth 
>>> if it is available? That way after the consumer gets back a positive 
>>> reply to the request it knows that it can use it.
>>> On the other hand if the reply was negative it can downgrade to some 
>>> other mode and retry.
>>> 
>>> Any solution that involves two separate steps would probably require 
>>> some way of locking the provider until the actual request has been 
>>> satisfied, right?
>> 
>> Agree with Thierry.
>> 
>> So icc_set() propagates any non-zero return value back to the caller. 
>> The interconnect platform driver could return -EAGAIN or -EBUSY or 
>> any other value (remaining bandwidth?) to indicate that the bandwidth 
>> is not available. Then this could be handled by the consumer driver 
>> (e.g display driver will retry with lower resolution). Does this 
>> sound like an option for you?
>> 
> 
> [Krishna S] We do handle the race condition. Internally, in tegra, we
> call these functions as reserve() and realize() calls.   During a
> reserve() call, the client requests for a certain iso bandwidth and if 
> it's available the iso bandwidth manager reserves the requested 
> bandwidth for this client and responds with a positive reply. The 
> reserved bandwidth is treated as committed and is not allocated to 
> other clients.  When the client makes the realize() call, that's when 
> the actual memory parameters are modified according to the previously
> reserved value.    The critical part of the reserve() function is
> protected by locks in order to avoid race condition.  If the
> icc_set() can propagate non-zero return value, that is a good option 
> to consider.

Ok, good. Thanks for confirming!

[Sanjay]: Yes, using the value returned by icc_set() looks like a good solution. However, when icc_set() fails, we should undo what aggregate_requests(node) had done earlier. Otherwise the "icc_node's (avg_bw, peak_bw)" and "interconnect summary debugfs node" would not be accurate. Please let me know your thoughts on this. 

>>> Also, if we do move towards something like this, perhaps a better 
>>> name for the function to file these requests would be
>>> icc_request() or something like that. This would make it somewhat 
>>> more obvious that the function is actually merely a request, rather 
>>> than a hard "apply" operation.
>> 
>> If we decide to extend the API with icc_request(), i imagine that the 
>> first step would be to call icc_request() on a path which will try to 
>> reserve the requested bandwidth (without actual request) until the 
>> subsequent icc_set() is called. Something like this is doable, but 
>> will require some more thought.
> 
> I think I may have been unclear. My suggestion was to rename
> icc_set() to icc_request() given that it is a request that can fail.
> Admittedly this is splitting hairs a little, and I don't feel strongly 
> about it.

Thanks for clarifying, Thierry. Whether a request could fail or not is also platform specific. I prefer icc_set() as it is shorter and seems more natural to have it in addition to icc_get() and icc_put().

> Having a single call to request bandwidth/latency/priority seems like 
> the easiest API. It ensures that when icc_set() return 0 (success) 
> that the bandwidth will already have been allocated for the request so 
> the consumer can continue assuming that it will be able to get enough 
> bandwidth. If for any reason it ends up not needing the bandwidth 
> after all, or needing less, it can simply call icc_set() again with 
> different parameters.
> 
> [Krishna S] Having a single call to request bandwidth/latency/priority 
> as suggested above seems like an ideal method.  We are having internal 
> discussion to see if our legacy drivers can handle a overhaul to the 
> new method.  Some of them do other preparations between reserve() and 
> realize() api calls, and needs a deeper dive, but in concept I believe 
> a single call method sounds cleaner.  We will get back on this next 
> week.

Alright, this sounds good to me. How do you define the latency - is it a round-trip latency between the endpoints? Do you use just a single required latency value or a range of min/max? What units do you use?

>>> Would it also perhaps be useful to have the function return the 
>>> maximum amount of bandwidth available on failure? That could provide 
>>> useful hints to the consumers about what modes to downgrade and 
>>> which will fail straight away anyway. Then again, this would entail 
>>> a two step approach, so the actual request for the lower bandwidth 
>>> might still fail if some other consumer requested additional 
>>> bandwidth since the failure
>> As mentioned above, maybe the value returned by icc_set() could be 
>> used as hint?
>> 
> 
> [Krishna S] We like the idea of the icc_set() return value to be used 
> as a hint but the client needs to be careful here.  The available 
> bandwidth is very dynamic and a hint may become stale before the 
> client acts on it.  However, the hint can be designed to be used in a 
> constructive way by each platform.
> 
>>>> 2.	How is the peak_bw actually defined and what is the
>>>> intended usage?  Need clarity on this. a.	A existing
>>>> implementation from Qualcomm, seems to do a max of all peak_bw on 
>>>> their code.  Does this mean that all consumers would not be using 
>>>> their peak_bw at the same time?  Why is it not a sum of all 
>>>> peak_bw.  So this is not clear to us.
>> 
>> The sum of all peak_bw will tend to significantly over-estimate the 
>> actual required bandwidth for the interconnect and also power 
>> requirement. There is an implicit assumption that either the peaks 
>> won't always line up, or if they do the use case is tolerant enough 
>> to handle it with the help of QoS hardware.
>> 
>> The aggregation is platform specific, so each platform can specify 
>> how the bandwidth is aggregated and adjust accordingly.
> 
> That sounds reasonable to me. I would expect that clients which have a 
> very hard requirement (such as display) would pass the same bandwidth 
> for both peak and average values. I would also presume that an 
> implementation would sum up all average requests to get at the total 
> requested bandwidth needed and use peak values for perhaps additional 
> margin or so.
> 
> [Krishna S] The explanation above makes sense to me given the implicit 
> assumptions.  Can any part of this be added to the documentation?

Ok sure. I expect that different SoCs can differ in how they aggregate requests. Will add some more information about that.

>>>> 3.	In addition to peak_bw and avg_bw  can interconnects support
>>>> a floor request on a clock? We need a floor request for clients 
>>>> which are affected by latency and not that much by bandwidth.
>>>> For example cpu is more latency sensitive than bandwidth in
>>>> some cases.   So cpu clients set a emc floor based on its
>>>> current cpu frequency to satisfy a minimum latency need.
>>> 
>>> What would be the difference between a floor request and a request 
>>> for peak bandwidth corresponding to that floor frequency?
>>> Couldn't the CPU just register a regular bandwidth request to 
>>> achieve the same goal? I mean registering a peak bandwidth that 
>>> meets the minimum requirements for the needed latency would ensure 
>>> that bandwidth never goes below that value, so it would basically be 
>>> a floor request. Perhaps this could also be a special case where 
>>> peak bandwidth and average bandwidth are actually the same?
>> 
>> So this depends on the aggregation. In the case above, if we max all 
>> the peak_bw - the peak_bw operates also as a floor request.
>> 
> 
> [Krishna S]  Understood that we can salvage peak_bw and use it for 
> floor purpose.  But the name is very misleading, if we endup using it 
> for floor purpose.  Hence the request for explicit floor value and the 
> platforms can either use peak_bw or floor_bw or both.

I think this depends of the point of view. I was not able to come up with something more generic than this, but i am open to suggestions.

>>>> 4.	Request to have tracing as a debug option. On every
>>>> icc_set() call print the path and aggregated avg bw value.
>>> 
>>> This could presumably be done with ftrace and would be, in my 
>>> opinion, a good addition to this framework.
>> 
>> This is a good idea!
>> 
>>>> a.	We also want to know what the request from every client is,
>>>> at a given instant, so that we can add testcase to ensure the
>>>> emc calculation code is doing the right thing.   Automated
>>>> tests can make use of this.
>>> 
>>> Would a debugfs interface be useful for this? I could imagine that 
>>> something similar to the common clock framework (per-client file 
>>> with current requests and an additional bandwidth_summary file with 
>>> a list of all requests and a total perhaps) would work very well for 
>>> this.
>> 
>> There is already an interconnect_summary file, which lists all the 
>> requests from drivers with requested/aggregated values.
> 
> Looks exactly like what I had in mind. Krishna, Sanjay, does that file 
> look like something that we could use for testing? Would we need 
> additional information in it?
> 
> [Krishna S] Yes, if this exists that would be exactly what we need.
> If the requested value includes all the fields (client id, avg_bw, 
> peak_bw, etc) and the final calculated value by the provider that is 
> all the information we need.

Ok, great!

Thanks,
Georgi