Hi Song,
On Thu, Jan 28, 2021 at 10:12 AM chensong_2000@xxxxxx
<chensong_2000@xxxxxx> wrote:
Dear experts,
When i was running cyclictest with ltp(kernel 4.19.90 rt, arch64),
cyclictest got a big latecy(more than 10ms), the ftrace log shows that
modprobe called preempt_disable for a long time, which stops cyclictest
being switched.
This works as designed. Running a latency sensitive application has
some inherent rules. You need to stick to that rules, otherwise
excessive latencies are to be expected. The best documentation (I
believe) you can find on
https://wiki.linuxfoundation.org/realtime/start (ex.
https://rt.wiki.kernel.org/).
In short, running modprobe (loading kernel modules in general) is not
allowed. For a production system, you are supposed to set up the
machine first (loading the modules during boot or before starting the
application). The application itself is bound to allowed operations
only, otherwise it has to expect blocking and unbounded latencies.
here is a piece of log:
modprobe-4551 0...10 712.754529: preempt_disable:
caller=0xffff0001019900c8 parent=0xffff0001019900c8
...
modprobe-4551 0d..10 712.754589: irq_disable: caller=el1_irq+0x7c
parent=0xffff0001019900dc
modprobe-4551 0d.h10 712.754590: irq_handler_entry: irq=2
name=arch_timer
modprobe-4551 0d.h20 712.754591: hrtimer_cancel:
hrtimer=0xffff8020f4983d98
modprobe-4551 0d.h10 712.754591: hrtimer_expire_entry:
hrtimer=0xffff8020f4983d98 now=712720218245 function=hrtimer_wakeup/0x0
modprobe-4551 0d.h20 712.754592: sched_waking: comm=cyclictest
pid=2212 prio=9 target_cpu=000
modprobe-4551 0dNh30 712.754593: sched_wakeup: cyclictest:2212 [9]
success=1 CPU:000
modprobe-4551 0dNh10 712.754594: hrtimer_expire_exit:
hrtimer=0xffff8020f4983d98
modprobe-4551 0dNh10 712.754594: irq_handler_exit: irq=2
ret=handled
Again, this is perfectly valid.
Further i found
preemptirq_delay_test(kernel/trace/preemptirq_delay_test.c) can
reproduce it easily and got the similar ftrace log.
You can find endless ways how to introduce latency to the system.
That's why some operations are not allowed. For more details see the
mentioned documentation. ^^
The root cause i found is in el1_irq, it checks preempt count and
TIF_NEED_RESCHED before it goes through the path el1_preempt,
preempt_disable just stops it happening.
As it should.
Then i came up an idea and did an experiment, call
preempt_count_set(0); and set_tsk_need_resched(task); in hrtimer_wakeup
to meet the expectation in order to reschdule.
Sounds like demolishing the pillars of the bridge, yet still expecting
it not to fall. This will break your system for sure.
static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
{
struct hrtimer_sleeper *t =
container_of(timer, struct hrtimer_sleeper, timer);
struct task_struct *task = t->task;
t->task = NULL;
if (task)
wake_up_process(task);
if(preempt_count())
preempt_count_set(0);
set_tsk_need_resched(task);
return HRTIMER_NORESTART;
}
but, failed, system froze, no panic information.
Not surprised.
Is there anyone having the same problem? i would appreciate it if you
could share me some information in this case, many thanks.
Simply, no modprobe while running the cyclictest. The cyclictest is
supposed to be running on a machine as if the machine was in
production and definitely not tortured by arbitrary synthetic tests
(unless the test simulates a real world scenario which actually can
happen in production).
Again, all the modules should be preloaded before running cyclictest
or the production application.
Realtime system is a fragile beast and you have to handle it with special care.
Have a nice day,
Daniel
BR
Song Chen