On 1/29/20 9:15 PM, Mathieu Poirier wrote: > On Mon, Jan 27, 2020 at 10:46:05AM +0100, Arnaud POULIQUEN wrote: >> >> >> On 1/24/20 7:44 PM, Mathieu Poirier wrote: >>> On Thu, 23 Jan 2020 at 10:49, Arnaud POULIQUEN <arnaud.pouliquen@xxxxxx> wrote: >>>> >>>> Hi Bjorn, Mathieu >>>> >>>> On 1/23/20 6:15 PM, Bjorn Andersson wrote: >>>>> On Thu 23 Jan 09:01 PST 2020, Mathieu Poirier wrote: >>>>> >>>>>> On Wed, 22 Jan 2020 at 12:40, Bjorn Andersson >>>>>> <bjorn.andersson@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> On Fri 10 Jan 13:28 PST 2020, Mathieu Poirier wrote: >>>>>>> >>>>>>>> On Thu, Dec 26, 2019 at 09:32:14PM -0800, Bjorn Andersson wrote: >>>>>>>>> Add a common panic handler that invokes a stop request and sleep enough >>>>>>>>> to let the remoteproc flush it's caches etc in order to aid post mortem >>>>>>>>> debugging. >>>>>>>>> >>>>>>>>> Signed-off-by: Bjorn Andersson <bjorn.andersson@xxxxxxxxxx> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> Changes since v1: >>>>>>>>> - None >>>>>>>>> >>>>>>>>> drivers/remoteproc/qcom_q6v5.c | 19 +++++++++++++++++++ >>>>>>>>> drivers/remoteproc/qcom_q6v5.h | 1 + >>>>>>>>> 2 files changed, 20 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c >>>>>>>>> index cb0f4a0be032..17167c980e02 100644 >>>>>>>>> --- a/drivers/remoteproc/qcom_q6v5.c >>>>>>>>> +++ b/drivers/remoteproc/qcom_q6v5.c >>>>>>>>> @@ -6,6 +6,7 @@ >>>>>>>>> * Copyright (C) 2014 Sony Mobile Communications AB >>>>>>>>> * Copyright (c) 2012-2013, The Linux Foundation. All rights reserved. >>>>>>>>> */ >>>>>>>>> +#include <linux/delay.h> >>>>>>>>> #include <linux/kernel.h> >>>>>>>>> #include <linux/platform_device.h> >>>>>>>>> #include <linux/interrupt.h> >>>>>>>>> @@ -15,6 +16,8 @@ >>>>>>>>> #include <linux/remoteproc.h> >>>>>>>>> #include "qcom_q6v5.h" >>>>>>>>> >>>>>>>>> +#define Q6V5_PANIC_DELAY_MS 200 >>>>>>>>> + >>>>>>>>> /** >>>>>>>>> * qcom_q6v5_prepare() - reinitialize the qcom_q6v5 context before start >>>>>>>>> * @q6v5: reference to qcom_q6v5 context to be reinitialized >>>>>>>>> @@ -162,6 +165,22 @@ int qcom_q6v5_request_stop(struct qcom_q6v5 *q6v5) >>>>>>>>> } >>>>>>>>> EXPORT_SYMBOL_GPL(qcom_q6v5_request_stop); >>>>>>>>> >>>>>>>>> +/** >>>>>>>>> + * qcom_q6v5_panic() - panic handler to invoke a stop on the remote >>>>>>>>> + * @q6v5: reference to qcom_q6v5 context >>>>>>>>> + * >>>>>>>>> + * Set the stop bit and sleep in order to allow the remote processor to flush >>>>>>>>> + * its caches etc for post mortem debugging. >>>>>>>>> + */ >>>>>>>>> +void qcom_q6v5_panic(struct qcom_q6v5 *q6v5) >>>>>>>>> +{ >>>>>>>>> + qcom_smem_state_update_bits(q6v5->state, >>>>>>>>> + BIT(q6v5->stop_bit), BIT(q6v5->stop_bit)); >>>>>>>>> + >>>>>>>>> + mdelay(Q6V5_PANIC_DELAY_MS); >>>>>>>> >>>>>>>> I really wonder if the delay should be part of the remoteproc core and >>>>>>>> configurable via device tree. Wanting the remote processor to flush its caches >>>>>>>> is likely something other vendors will want when dealing with a kernel panic. >>>>>>>> It would be nice to see if other people have an opinion on this topic. If not >>>>>>>> then we can keep the delay here and move it to the core if need be. >>>>>>>> >>>>>>> >>>>>>> I gave this some more thought and what we're trying to achieve is to >>>>>>> signal the remote processors about the panic and then give them time to >>>>>>> react, but per the proposal (and Qualcomm downstream iirc) we will do >>>>>>> this for each remote processor, one by one. >>>>>>> >>>>>>> So in the typical case of a Qualcomm platform with 4-5 remoteprocs we'll >>>>>>> end up giving the first one a whole second to react and the last one >>>>>>> "only" 200ms. >>>>>>> >>>>>>> Moving the delay to the core by iterating over rproc_list calling >>>>>>> panic() and then delaying would be cleaner imo. >>>>>> >>>>>> I agree. >>>>>> >>>>>>> >>>>>>> It might be nice to make this configurable in DT, but I agree that it >>>>>>> would be nice to hear from others if this would be useful. >>>>>> >>>>>> I think the delay has to be configurable via DT if we move this to the >>>>>> core. The binding can be optional and default to 200ms if not >>>>>> present. >>>>>> >>>>> >>>>> How about I make the panic() return the required delay and then we let >>>>> the core sleep for MAX() of the returned durations? >>> >>> I like it. >>> >>>> That way the default >>>>> is still a property of the remoteproc drivers - and 200ms seems rather >>>>> arbitrary to put in the core, even as a default. >>>> >>>> I agree with Bjorn, the delay should be provided by the platform. >>>> But in this case i wonder if it is simpler to just let the platform take care it? >>> >>> If I understand you correctly, that is what Bjorn's original >>> implementation was doing and it had drawbacks. >> Yes, >> Please tell me if i missed something, the only drawback seems mentioned is the accumulative delay. > > Yes, that is correct. > >> Could you elaborate how to implement the delay in remote proc core for multi rproc instance. >> Here is my view: >> To optimize the delay it would probably be necessary to compute: >> - the delay based on an initial date, >> - the delay requested by each rproc instance, >> - the delay elapsed in each rproc panic ops. >> Feasible but not straight forward... >> So I suppose that you are thinking about a solution based on the store of the max delay that would be applied after last panic() return? > > Yes > >> anyway, how do you determine the last rproc instance? seems that a prerequisite would be that the panic ops is mandatory... > > Each ->panic() should return the amount of time to way or 0 if no delay is > required. If an rpoc doesn't implement ->panic() then it is treated as 0. > From there wait for the maximum time that was collected. > > It would be possible to do something more complicated like taking timestamps > everytime a ->panic() returns and optimize the time to wait for but that may be > for a future set. The first implementation could go with an simple heuristic as > detailed above. Seems reasonable. A last point. i don't know if the case is realistic so i prefer to mention it: we can imagine that a rproc platform driver already manages a panic (for instance a video decoder driver that uses a coprocessor). In this case there is a risk that the rproc is remove during the panic. Depending on the panic sequence ordering this could generate a side effect (no delay as last rproc panic ops could be never called). But seems not too tricky to take it into account in remoteproc core. > >> >> I'm not familiar with panic mechanism, but how panic ops are scheduled in SMP? Does panics ops would be treated in parallel (using msleep instead of mdelay)? >> In this case delays could not be cumulative... > > The processor that triggered the panic sequentially runs the notifier registered > with the panic_notifier_list. Other processors are instructed to take > themselves offline. As such there won't be multiple ->panic() running > concurrently. Thank you for that explanation! Regards, Arnaud > >> >>> >>>> For instance for stm32mp1 the stop corresponds to the reset on the remote processor core. To inform the coprocessor about an imminent shutdown we use a signal relying on a mailbox (cf. stm32_rproc_stop). >>>> In this case we would need a delay between the signal and the reset, but not after (no cache management). >>> >>> Here I believe you are referring to the upper limit of 500ms that is >>> needed for the mbox_send_message() in stm32_rproc_stop() to complete. >>> Since that is a blocking call I think it would fit with Bjorn's >>> proposal above if a value of '0' is returned by rproc->ops->panic(). >>> That would mean no further delays are needed (because the blocking >>> mbox_send_message() would have done the job already). Let me know if >>> I'm in the weeds. >> Yes you are :), this is what i thought, if delay implemented in core. > > Not sure I understand your last reply but I _think_ we are saying the same > thing. > >> >> Regards, >> Arnaud >> >>> >>>> >>>> Regards, >>>> Arnaud >>>>> >>>>> Regards, >>>>> Bjorn >>>>>