Hi Georgi, Please find my question inline marked with [Sanjay] -----Original Message----- From: Georgi Djakov <georgi.djakov@xxxxxxxxxx> Sent: Monday, November 19, 2018 7:38 AM To: Krishna Sitaraman <ksitaraman@xxxxxxxxxx>; Thierry Reding <thierry.reding@xxxxxxxxx> Cc: Jonathan Hunter <jonathanh@xxxxxxxxxx>; Peter De Schrijver <pdeschrijver@xxxxxxxxxx>; Dmitry Osipenko <digetx@xxxxxxxxx>; linux-tegra <linux-tegra@xxxxxxxxxxxxxxx>; Sanjay Chandrashekara <sanjayc@xxxxxxxxxx>; Linux PM list <linux-pm@xxxxxxxxxxxxxxx> Subject: Re: interconnects on Tegra Hi Thierry and Krishna, On 11/17/18 01:16, Krishna Sitaraman wrote: > My responses inline. (I am using outlook and currently my responses > are marked with [Krishna S]. In near future I will switch to a more > linux friendly indentation. So please bear with me till then. ) > > -----Original Message----- From: Thierry Reding > <thierry.reding@xxxxxxxxx> Sent: Friday, November 16, 2018 3:54 AM > To: Georgi Djakov <georgi.djakov@xxxxxxxxxx> Cc: Krishna Sitaraman > <ksitaraman@xxxxxxxxxx>; Jonathan Hunter <jonathanh@xxxxxxxxxx>; Peter > De Schrijver <pdeschrijver@xxxxxxxxxx>; Dmitry Osipenko > <digetx@xxxxxxxxx>; linux-tegra <linux-tegra@xxxxxxxxxxxxxxx>; Sanjay > Chandrashekara <sanjayc@xxxxxxxxxx>; Linux PM list > <linux-pm@xxxxxxxxxxxxxxx> Subject: Re: interconnects on Tegra > > On Fri, Nov 16, 2018 at 01:06:46PM +0200, Georgi Djakov wrote: >> Hi, >> >> Adding linux-pm list. >> >> On 11/15/18 18:47, Thierry Reding wrote: >>> On Thu, Nov 15, 2018 at 01:17:58AM +0000, Krishna Sitaraman >>> wrote: >>>> Thierry, thanks for looping us in. Reading this thread, I believe >>>> Peter has given a brief overview of how we manage memory bus clock >>>> along with latency & priority information to the memory subsystem >>>> for isochronous clients. I want to add to >>>> this: - The total amount of isochronous bandwidth achievable is >>>> limited and we have to arbitrate the available iso bandwidth at >>>> runtime among the clients. So our tegra framework provides a >>>> mechanism for clients to check and then lock a particular iso >>>> bandwidth before attempting to switch to the desired mode which >>>> uses it. - The tegra framework also provides mechanism for the >>>> isochronous clients to set a latency and/or priority information to >>>> the memory arbitration hardware. >>>> >>>> The interconnect framework seems to be a good fit and we might be >>>> able to make use of it. However there are certain additional >>>> functionality we want to request or suggest that can help in using >>>> the interconnects. >> >> Thanks for the comments! >> >>>> >>>> Listing them out here: >>>> >>>> 1. Isochronous bandwidth manager needs to provide feedback to >>>> the clients (consumers) to know if a particular iso bandwidth >>>> request is possible or not before clients can make a definite >>>> switch. Example Display wanting to know if a mode is possible >>>> before switch to the new configuration. The interconnect framework >>>> needs a method for provider to give a feedback to the requesting >>>> consumer. A check or is_possible request before actual set >>>> request. >>> >>> How would we handle races with other bandwidth requests? Let's say >>> the display driver requests a check for a certain bandwidth, then >>> gets back a positive reply, after which it would go make the actual >>> request, but at that point somebody else might have already >>> requested additional bandwidth, in turn making the actual request >>> fail. >>> >>> Would it perhaps be a good idea to just make the actual request fail >>> if there's not enough bandwidth and allocate the requested bandwidth >>> if it is available? That way after the consumer gets back a positive >>> reply to the request it knows that it can use it. >>> On the other hand if the reply was negative it can downgrade to some >>> other mode and retry. >>> >>> Any solution that involves two separate steps would probably require >>> some way of locking the provider until the actual request has been >>> satisfied, right? >> >> Agree with Thierry. >> >> So icc_set() propagates any non-zero return value back to the caller. >> The interconnect platform driver could return -EAGAIN or -EBUSY or >> any other value (remaining bandwidth?) to indicate that the bandwidth >> is not available. Then this could be handled by the consumer driver >> (e.g display driver will retry with lower resolution). Does this >> sound like an option for you? >> > > [Krishna S] We do handle the race condition. Internally, in tegra, we > call these functions as reserve() and realize() calls. During a > reserve() call, the client requests for a certain iso bandwidth and if > it's available the iso bandwidth manager reserves the requested > bandwidth for this client and responds with a positive reply. The > reserved bandwidth is treated as committed and is not allocated to > other clients. When the client makes the realize() call, that's when > the actual memory parameters are modified according to the previously > reserved value. The critical part of the reserve() function is > protected by locks in order to avoid race condition. If the > icc_set() can propagate non-zero return value, that is a good option > to consider. Ok, good. Thanks for confirming! [Sanjay]: Yes, using the value returned by icc_set() looks like a good solution. However, when icc_set() fails, we should undo what aggregate_requests(node) had done earlier. Otherwise the "icc_node's (avg_bw, peak_bw)" and "interconnect summary debugfs node" would not be accurate. Please let me know your thoughts on this. >>> Also, if we do move towards something like this, perhaps a better >>> name for the function to file these requests would be >>> icc_request() or something like that. This would make it somewhat >>> more obvious that the function is actually merely a request, rather >>> than a hard "apply" operation. >> >> If we decide to extend the API with icc_request(), i imagine that the >> first step would be to call icc_request() on a path which will try to >> reserve the requested bandwidth (without actual request) until the >> subsequent icc_set() is called. Something like this is doable, but >> will require some more thought. > > I think I may have been unclear. My suggestion was to rename > icc_set() to icc_request() given that it is a request that can fail. > Admittedly this is splitting hairs a little, and I don't feel strongly > about it. Thanks for clarifying, Thierry. Whether a request could fail or not is also platform specific. I prefer icc_set() as it is shorter and seems more natural to have it in addition to icc_get() and icc_put(). > Having a single call to request bandwidth/latency/priority seems like > the easiest API. It ensures that when icc_set() return 0 (success) > that the bandwidth will already have been allocated for the request so > the consumer can continue assuming that it will be able to get enough > bandwidth. If for any reason it ends up not needing the bandwidth > after all, or needing less, it can simply call icc_set() again with > different parameters. > > [Krishna S] Having a single call to request bandwidth/latency/priority > as suggested above seems like an ideal method. We are having internal > discussion to see if our legacy drivers can handle a overhaul to the > new method. Some of them do other preparations between reserve() and > realize() api calls, and needs a deeper dive, but in concept I believe > a single call method sounds cleaner. We will get back on this next > week. Alright, this sounds good to me. How do you define the latency - is it a round-trip latency between the endpoints? Do you use just a single required latency value or a range of min/max? What units do you use? >>> Would it also perhaps be useful to have the function return the >>> maximum amount of bandwidth available on failure? That could provide >>> useful hints to the consumers about what modes to downgrade and >>> which will fail straight away anyway. Then again, this would entail >>> a two step approach, so the actual request for the lower bandwidth >>> might still fail if some other consumer requested additional >>> bandwidth since the failure >> As mentioned above, maybe the value returned by icc_set() could be >> used as hint? >> > > [Krishna S] We like the idea of the icc_set() return value to be used > as a hint but the client needs to be careful here. The available > bandwidth is very dynamic and a hint may become stale before the > client acts on it. However, the hint can be designed to be used in a > constructive way by each platform. > >>>> 2. How is the peak_bw actually defined and what is the >>>> intended usage? Need clarity on this. a. A existing >>>> implementation from Qualcomm, seems to do a max of all peak_bw on >>>> their code. Does this mean that all consumers would not be using >>>> their peak_bw at the same time? Why is it not a sum of all >>>> peak_bw. So this is not clear to us. >> >> The sum of all peak_bw will tend to significantly over-estimate the >> actual required bandwidth for the interconnect and also power >> requirement. There is an implicit assumption that either the peaks >> won't always line up, or if they do the use case is tolerant enough >> to handle it with the help of QoS hardware. >> >> The aggregation is platform specific, so each platform can specify >> how the bandwidth is aggregated and adjust accordingly. > > That sounds reasonable to me. I would expect that clients which have a > very hard requirement (such as display) would pass the same bandwidth > for both peak and average values. I would also presume that an > implementation would sum up all average requests to get at the total > requested bandwidth needed and use peak values for perhaps additional > margin or so. > > [Krishna S] The explanation above makes sense to me given the implicit > assumptions. Can any part of this be added to the documentation? Ok sure. I expect that different SoCs can differ in how they aggregate requests. Will add some more information about that. >>>> 3. In addition to peak_bw and avg_bw can interconnects support >>>> a floor request on a clock? We need a floor request for clients >>>> which are affected by latency and not that much by bandwidth. >>>> For example cpu is more latency sensitive than bandwidth in >>>> some cases. So cpu clients set a emc floor based on its >>>> current cpu frequency to satisfy a minimum latency need. >>> >>> What would be the difference between a floor request and a request >>> for peak bandwidth corresponding to that floor frequency? >>> Couldn't the CPU just register a regular bandwidth request to >>> achieve the same goal? I mean registering a peak bandwidth that >>> meets the minimum requirements for the needed latency would ensure >>> that bandwidth never goes below that value, so it would basically be >>> a floor request. Perhaps this could also be a special case where >>> peak bandwidth and average bandwidth are actually the same? >> >> So this depends on the aggregation. In the case above, if we max all >> the peak_bw - the peak_bw operates also as a floor request. >> > > [Krishna S] Understood that we can salvage peak_bw and use it for > floor purpose. But the name is very misleading, if we endup using it > for floor purpose. Hence the request for explicit floor value and the > platforms can either use peak_bw or floor_bw or both. I think this depends of the point of view. I was not able to come up with something more generic than this, but i am open to suggestions. >>>> 4. Request to have tracing as a debug option. On every >>>> icc_set() call print the path and aggregated avg bw value. >>> >>> This could presumably be done with ftrace and would be, in my >>> opinion, a good addition to this framework. >> >> This is a good idea! >> >>>> a. We also want to know what the request from every client is, >>>> at a given instant, so that we can add testcase to ensure the >>>> emc calculation code is doing the right thing. Automated >>>> tests can make use of this. >>> >>> Would a debugfs interface be useful for this? I could imagine that >>> something similar to the common clock framework (per-client file >>> with current requests and an additional bandwidth_summary file with >>> a list of all requests and a total perhaps) would work very well for >>> this. >> >> There is already an interconnect_summary file, which lists all the >> requests from drivers with requested/aggregated values. > > Looks exactly like what I had in mind. Krishna, Sanjay, does that file > look like something that we could use for testing? Would we need > additional information in it? > > [Krishna S] Yes, if this exists that would be exactly what we need. > If the requested value includes all the fields (client id, avg_bw, > peak_bw, etc) and the final calculated value by the provider that is > all the information we need. Ok, great! Thanks, Georgi