Hi!
Julia Cartwright <julia@xxxxxx> writes:
Hello all-
On Wed, Sep 19, 2018 at 12:43:03PM -0700, Guenter Roeck wrote:
On Wed, Sep 19, 2018 at 08:46:19AM +0200, Steffen Trumtrar
wrote:
> On Tue, Sep 18, 2018 at 06:46:15AM -0700, Guenter Roeck
> wrote:
[..]
> The problem I observe, is that the watchdog is trigged,
> because it doesn't get pinged.
> The ksoftirqd seems to be blocked although it runs at a much
> higher priority than the
> blocking userspace task.
>
Are you sure about that ? The other email seemed to suggest
that the userspace
task is running at higher priority.
Also: ksoftirqd is irrelevant on RT for the kernel watchdog
thread. The
relevant thread is ktimersoftd, which is the thread responsible
for
invoking hrtimer expiry functions, like what's being used for
watchdogd.
[..]
Overall, we have a number possibilities to consider:
- The kernel watchdog timer thread is not triggered at all
under some
circumstances, meaning it is not set properly. So far we have
no real
indication that this is the case (since the code works fine
unless some
userspace task takes all available CPU time).
What do you mean by "not triggered". Do you mean
woken-up/activated
from a scheduling perspective? In the case I identified in my
other
email, the watchdogd thread wakeup doesn't even occur, even when
the
periodic ping timer expires, because ktimersoftd has been
starved.
I suspect that's what's going on for Steffen, but am not yet
sure.
- The watchdog device is closed. The kernel watchdog timer
thread is
starved and does not get to run. The question is what to do
in this
situation. In a real time system, this is almost always a
fatal
condition. Should the system really be kept alive in this
situation ?
Sometimes its the right decision, sometimes its not. The only
sensible
thing to do is to allow the user make the decision that's right
for
their application needs by allowing the relative prioritization
of
watchdogd and their application threads.
...which they can do now, but it's not effective on RT because
of the
timer deferral through ktimersoftd.
The solution, in my mind, and like I mentioned in my other
email, is to
opt-out of the ktimersoftd-deferral mechanism. This requires
some
tweaking with the kthread_worker bits to ensure safety in
hardirq
context, but that seems straightforward. See the below.
I just tested your patch and it works for me \o/
Thanks,
Steffen
--
Pengutronix e.K. | Steffen Trumtrar
|
Industrial Linux Solutions |
http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany| Phone:
+49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax:
+49-5121-206917-5555|