On Thu, Oct 13, 2022 at 05:02:15PM +0100, Cristian Marussi wrote: > On Thu, Oct 13, 2022 at 03:05:43PM +0800, YaxiongTian wrote: > > Hi Cristian > > > > �� There may be a problem with my qq email client, � I don't see my mail in > > the > > > > communityI had to switch outlook email.Forgive me if you've received > > multiple emails. > > > No worries. > > > >Problem is anyway, as you said, you'll have to pick this timeout from the > > >related transport scmi_desc (even if as of now the max_rx_timeout for > > >all existent shared mem transport is the same..) and this means anyway > > >adding more complexity to the chain of calls to just to print a warn of > > >some kind in a rare error-situation from which you cannot recover anyway. > > > > � Yes,it has add more complexity about Monitorring this time.For system > > stability,the safest thing to do is to abort the transmission.But this will > > lose performance due to more complexity in such unusual situation. > > > > >Due to other unrelated discussions, I was starting to think about > > >exposing some debug-only (Kconfig dependent) SCMI stats like timeouts, > > errors, > > >unpexpected/OoO/late_replies in order to ease the debug and monitoring > > >of the health of a running SCMI stack: maybe this could be a place where > > >to flag this FW issues without changing the spinloop above (or > > >to add the kind of timeout you mentioned but only when some sort of > > >CONFIG_SCMI_DEBUG is enabled...)...still to fully think it through, though. > > > > � I think it should active report warn or err rather than user queries the > > information manually.(i.e fs_debug way).Becasue in system startup\S1\S3\S4, > > user can not queries this flag in Fw,they need get stuck message > > immediately. > > > > Looking more closely at this, I experimented a bit with an SCMI stack based on > mailbox transport in which I had forcefully set the spin_until_cond() to > spin forever. > > Even though on a normal SCMI system when the SCMI stack fails at boot > the system is supposed to boot anyway (maybe slower), this particular > failure in TX path led indeed to a system that does not boot at all and > spits out an infinite sequence of: > > [ 2924.499486] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [ 2924.505596] rcu: 2-...0: (0 ticks this GP) idle=1be4/1/0x4000000000000000 softirq=50/50 fqs=364757 > [ 2924.514672] (detected by 4, t=730678 jiffies, g=-1119, q=134 ncpus=6) > [ 2924.521215] Task dump for CPU 2: > [ 2924.524445] task:kworker/u12:0 state:R running task stack: 0 pid: 9 ppid: 2 flags:0x0000000a > [ 2924.534391] Workqueue: events_unbound deferred_probe_work_func > [ 2924.540244] Call trace: > [ 2924.542691] __switch_to+0xe4/0x1b8 > [ 2924.546189] deferred_probe_work_func+0xa4/0xf8 > [ 2924.550731] process_one_work+0x208/0x480 > [ 2924.554754] worker_thread+0x230/0x428 > [ 2924.558514] kthread+0x114/0x120 > [ 2924.561752] ret_from_fork+0x10/0x20 > > I imagine this is the annoying thing you want to avoid. > > So experimenting a bit with a patch similar to yours (ignoring the timeout > config issues and using the static cnt to temporarily stuck and revive the SCMI > transport) > > ------>8----- > diff --git a/drivers/firmware/arm_scmi/shmem.c b/drivers/firmware/arm_scmi/shmem.c > index 0e3eaea5d852..6dde669abd03 100644 > --- a/drivers/firmware/arm_scmi/shmem.c > +++ b/drivers/firmware/arm_scmi/shmem.c > @@ -8,6 +8,7 @@ > #include <linux/io.h> > #include <linux/processor.h> > #include <linux/types.h> > > #include "common.h" > > @@ -29,17 +30,28 @@ struct scmi_shared_mem { > u8 msg_payload[]; > }; > > +static int cnt = 50; > void shmem_tx_prepare(struct scmi_shared_mem __iomem *shmem, > struct scmi_xfer *xfer) > { > + ktime_t stop; > + > /* > * Ideally channel must be free by now unless OS timeout last > * request and platform continued to process the same, wait > * until it releases the shared memory, otherwise we may endup > * overwriting its response with new message payload or vice-versa > */ > - spin_until_cond(ioread32(&shmem->channel_status) & > - SCMI_SHMEM_CHAN_STAT_CHANNEL_FREE); > + stop = ktime_add_ms(ktime_get(), 35); > + spin_until_cond(((--cnt > 0) && ioread32(&shmem->channel_status) & > + SCMI_SHMEM_CHAN_STAT_CHANNEL_FREE) || > + ktime_after(ktime_get(), stop)); > + if (ktime_after(ktime_get(), stop)) { > + pr_warn_once("TX Timeout !\n"); > + cnt = 10; > + return; > + } > + > /* Mark channel busy + clear error */ > iowrite32(0x0, &shmem->channel_status); > iowrite32(xfer->hdr.poll_completion ? 0 : SCMI_SHMEM_FLAG_INTR_ENABLED, > ----8<------------- > > With the above I had in fact a system that could boot even with a > failing/stuck SCMI transport, but, as expected the SCMI stack > functionality was totally compromised after the first timeout with no > possibility to recover. > > Moreover I was thinking at what could happen if later on after boot the > SCMI server should end in some funny/hogged condition so that it is, > only temporarily, a bit slower to answer and release the channel: with > the current implemenation the Kernel agent will spin just a little bit > more waiting for the channel to be freed and then everything carries > without much hassle, while with this possible new timing-out solution > we could end up dropping that transmission and compromising the whole > transport fucntionality for all the subsequent transmissions. > > So, again, I'm not sure it is worth making such a change even for debug > purposes, given that in the worst scenario above you end up with a > system stuck at boot but for which the SCMI stack is anyway compromised > and where the only solution is fixing the server FW really. > > I'll ask Sudeep is thoughts about the possible hang. > I am fine with the patch as it provides more info on what is going wrong in the system. Please post the patch separately with all the info/background. -- Regards, Sudeep