Re: preempt_disable causes big latency

Daniel Vacek <neelx.g@xxxxxxxxx> · Thu, 28 Jan 2021 12:30:25 +0100

Hi Song,

On Thu, Jan 28, 2021 at 10:12 AM chensong_2000@xxxxxx
<chensong_2000@xxxxxx> wrote:
> Dear experts,
>
> When i was running cyclictest with ltp(kernel 4.19.90 rt, arch64),
> cyclictest got a big latecy(more than 10ms), the ftrace log shows that
> modprobe called preempt_disable for a long time, which stops cyclictest
> being switched.

This works as designed. Running a latency sensitive application has
some inherent rules. You need to stick to that rules, otherwise
excessive latencies are to be expected. The best documentation (I
believe) you can find on
https://wiki.linuxfoundation.org/realtime/start (ex.
https://rt.wiki.kernel.org/).

In short, running modprobe (loading kernel modules in general) is not
allowed. For a production system, you are supposed to set up the
machine first (loading the modules during boot or before starting the
application). The application itself is bound to allowed operations
only, otherwise it has to expect blocking and unbounded latencies.

> here is a piece of log:
>
> modprobe-4551    0...10   712.754529: preempt_disable:
> caller=0xffff0001019900c8 parent=0xffff0001019900c8
> ...
> modprobe-4551    0d..10   712.754589: irq_disable: caller=el1_irq+0x7c
> parent=0xffff0001019900dc
> modprobe-4551    0d.h10   712.754590: irq_handler_entry:    irq=2
> name=arch_timer
> modprobe-4551    0d.h20   712.754591: hrtimer_cancel:
> hrtimer=0xffff8020f4983d98
> modprobe-4551    0d.h10   712.754591: hrtimer_expire_entry:
> hrtimer=0xffff8020f4983d98 now=712720218245 function=hrtimer_wakeup/0x0
> modprobe-4551    0d.h20   712.754592: sched_waking: comm=cyclictest
> pid=2212 prio=9 target_cpu=000
> modprobe-4551    0dNh30   712.754593: sched_wakeup: cyclictest:2212 [9]
> success=1 CPU:000
> modprobe-4551    0dNh10   712.754594: hrtimer_expire_exit:
> hrtimer=0xffff8020f4983d98
> modprobe-4551    0dNh10   712.754594: irq_handler_exit:     irq=2
> ret=handled

Again, this is perfectly valid.

> Further i found
> preemptirq_delay_test(kernel/trace/preemptirq_delay_test.c) can
> reproduce it easily and got the similar ftrace log.

You can find endless ways how to introduce latency to the system.
That's why some operations are not allowed. For more details see the
mentioned documentation. ^^

> The root cause i found is in el1_irq, it checks preempt count and
> TIF_NEED_RESCHED before it goes through the path el1_preempt,
> preempt_disable just stops it happening.

As it should.

> Then i came up an idea and did an experiment,  call
> preempt_count_set(0); and set_tsk_need_resched(task); in hrtimer_wakeup
> to meet the expectation in order to reschdule.

Sounds like demolishing the pillars of the bridge, yet still expecting
it not to fall. This will break your system for sure.

> static enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer)
> {
>      struct hrtimer_sleeper *t =
>          container_of(timer, struct hrtimer_sleeper, timer);
>      struct task_struct *task = t->task;
>
>      t->task = NULL;
>      if (task)
>          wake_up_process(task);
>
>      if(preempt_count())
>          preempt_count_set(0);
>      set_tsk_need_resched(task);
>
>      return HRTIMER_NORESTART;
> }
>
> but, failed, system froze, no panic information.

Not surprised.

> Is there anyone having the same problem? i would appreciate it if you
> could share me some information in this case, many thanks.

Simply, no modprobe while running the cyclictest. The cyclictest is
supposed to be running on a machine as if the machine was in
production and definitely not tortured by arbitrary synthetic tests
(unless the test simulates a real world scenario which actually can
happen in production).
Again, all the modules should be preloaded before running cyclictest
or the production application.

Realtime system is a fragile beast and you have to handle it with special care.

Have a nice day,
Daniel

> BR
>
> Song Chen