Re: Trying to trace when RT FF & RR threads hit RT Throttle limit (kernel.sched_rt_runtime_us)

Brian Hutchinson <b.hutchman@xxxxxxxxx> · Mon, 1 May 2023 15:35:43 -0400

On Mon, May 1, 2023 at 11:21 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Mon, 1 May 2023 11:15:45 -0400
> Brian Hutchinson <b.hutchman@xxxxxxxxx> wrote:
>
> > Hey Steve, good to hear from you.
> >
> > On Mon, May 1, 2023 at 10:17 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> > >
> > > On Mon, 1 May 2023 00:42:03 -0400
> > > Brian Hutchinson <b.hutchman@xxxxxxxxx> wrote:
> > >
> > > > Hi,
> > > >
> > > > Using 5.10.69 kernel on a i.MX8 platform with isolated cpu's (isolcpus
> > > > on cmd line) with RT_RUNTIME_SHARE and wanting to know if (and who) is
> > > > hitting the RT Throttle limit.  Since my kernel does not have a
> > > > tracepoint for this, I used gdb and disassembled kernel/sched/rt.c to
> > > > find the address where throttling gets set and then added a kprobe for
> > > > it.
> > >
> > > Where exactly did you add this kprobe?
> >
> > It was:
> > /sys/kernel/debug/tracing
> >
> > echo 'p 0xffff8000100a8e38 %x0' > kprobe_events
>
> I meant, where was the exact location that address pointed to in the code,
> not where in the user interface did you add it ;-)

Douh!  Ok, sorry, my bad.

First attempt I put it here:

   915                         if (rt_rq->rt_time || rt_rq->rt_nr_running)
   916                                 idle = 0;
   917                         raw_spin_unlock(&rt_rq->rt_runtime_lock);
   918                 } else if (rt_rq->rt_nr_running) {
   919                         idle = 0;
   920                         if (!rt_rq_throttled(rt_rq))
   921                                 enqueue = 1;
   922                 }
   923                 if (rt_rq->rt_throttled)
   924                         throttled = 1;
   925
   926                 if (enqueue)
   927                         sched_rt_rq_enqueue(rt_rq);
   928                 raw_spin_unlock(&rq->lock);
   929         }

on line 924.  I'm not the most versed at A53 assembly code so I
guessed which instruction to use (it's always tricky with pipelining
etc.) then cheated and put a breakpoint on line 924 to confirm and
used that address.

Then later I did the same thing:
964         if (rt_rq->rt_time > runtime) {
   965                 struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
   966
   967                 /*
   968                  * Don't actually throttle groups that have no
runtime assigned
   969                  * but accrue some time due to boosting.
   970                  */
   971                 if (likely(rt_b->rt_runtime)) {
   972                         rt_rq->rt_throttled = 1;
   973                         printk_deferred_once("sched: RT
throttling activated\n");
   974                 } else {
   975                         /*
   976                          * In case we did anyway, make it go away,
   977                          * replenishment is a joke, since it
will replenish us
   978                          * with exactly 0 ns.
   979                          */
   980                         rt_rq->rt_time = 0;
   981                 }
   982
   983                 if (rt_rq_throttled(rt_rq)) {
   984                         sched_rt_rq_dequeue(rt_rq);
   985                         return 1;
   986                 }
   987         }

... and put a breakpoint on line 972 and used that address for kprobe.

When I did this I pretty much got the same results as before but I got
a lot of trace records more quickly (most were "idle") and a kworker
thread showed up in the mix too.

I'm also not a scheduler expert so I don't quite know the differences
between the two places rt_throttled gets set (although it should be
easy from the copious amount of comments ... not!).  I just suspect
the throttle is kicking in which is choking off sched_other OS stuff
and I'm trying to figure out who's doing it and why.

>
> >
> > Your "Fun with Dynamic Kernel Tracing Events" talk in 2018 was cool
> > with all those examples but you never gave examples on how to do this
> > kind of stuff wink, wink, ha, ha.
> >
> > >
> > > >
> > > > When I look at the trace I'm seeing what looks like "idle" being
> > > > throttled in addition to other things.  I "think" my probe is working
> > > > as when I look at the trace, the processes that show up look like they
> > > > have run over the 95% limit of one period (defaulted to 1 second) but
> > > > I'm confused as to why Idle shows up.
> > >
> > > Are you saying this because it's happening while idle is running?
> >
> > I don't know what I'm saying.  Part of me posting was me questioning
> > if what I'm seeing is even valid ... but it makes more sense now that
> > you point out that it's a hard interrupt happening while idle.  Now I
> > just have to figure out which interrupt.
> >
> > >
> > > >
> > > > I've not ran lttng trace capturing context switches yet to see what
> > > > other processes might be involved around the time the throttle kicks
> > > > in, at the moment I'm just trying to validate I'm going after the
> > > > problem and setting it up the right way.
> > > >
> > > > I did identify a patch that enables a tracepoint for the rt throttle
> > > > but it is for RT_GROUP_SCHED and we aren't using that at the moment.
> > > >
> > > > I suspect the application I'm trying to debug has some misbehaving
> > > > realtime processes (not using a rt patched kernel) that are being
> > > > throttled so I'm trying to identify them so they can be studied and
> > > > made to behave better.
> > > >
> > > > If anyone has a better idea or advice on how to go about this please
> > > > point me in the right direction.
> > > >
> > > > Below is a sample of a trace I captured using the kprobe I mentioned
> > > > above when rt_throttled=1.  I don't quite understand how Idle can be 3
> > > > levels deep in preemption or why "idle" is even showing up.
> > > >
> > > > CPU's 1, 2 and 3 have been isolated and the application that appears
> > > > to be getting throttled is on core 2:
> > > >
> > > > # tracer: nop
> > > > #
> > > > # entries-in-buffer/entries-written: 592/592   #P:4
> > > > #
> > > > #                                _-----=> irqs-off
> > > > #                               / _----=> need-resched
> > > > #                              | / _---=> hardirq/softirq
> > > > #                              || / _--=> preempt-depth
> > > > #                              ||| /     delay
> > > > #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> > > > #              | |         |   ||||      |         |
> > > >          <idle>-0       [002] d.h3   508.964800: p_0x00000000223c0e95:
> > > > (sched_rt_period_timer+0x1f0/0x328) arg1=0x0
> > >
> > >
> > > Note the "d.h3", which means as the heading states, 'd' is interrupts are
> > > disabled, 'h' means it is running in hard interrupt context. That means,
> > > even though idle is the current task, an interrupt triggered, and this is
> > > running in interrupt context.
> >
> > As I said above, makes more sense why idle is showing up now that you
> > pointed that out.  This is an app that was ported from a different OS
> > that used a single core arch and now that's its on multicore arch
> > (quad core A53's) with SMP Linux, I think there are some old school
> > disable interrupts/preempt critical section areas that are going to
> > need to be reworked ... which is one of the reasons I believe it was
> > necessary to pin the apps to an isolated core on the i.MX8 to get them
> > to even run.
> >
> > >
> > > I see it's running sched_rt_period_timer() which calls
> > > do_sched_rt_period_timer(), and if you look at that function, it does
> > > for_each_cpu(i, span), checking other CPUs to see if it should be throttled
> > > or not.
> >
> > ... about that.  I screwed up and meant to say scheduler features is
> > set to NO_RT_RUNTIME_SHARE in the original post as I've read there is
> > some weirdness on "what" sched_rt_runtime_us means ... a limit on the
> > system as a whole (all cpu's) or each individual cpu. I "believe" the
> > way I'm running it is each individual cpu gets sched_rt_runtime_us for
> > realtime stuff.
>
> Correct. I've ran some tests on rt spinners where if their affinity is
> broad they bounce around CPUs and still maintain 100% CPU resource (but on
> different CPUs). But if you pin it to a single CPU, it will get throttled.

I don't "think" anything is moving around here unless the OS is going
against what it's been told to do.  The cpu's are isolated with
cmdline at boot and the affinity of these processes has been selected
when they startup so they should stay where they are put unless there
is something going on I don't know about.  I think I've seen talk
before that the affinity is just a "suggestion" making me think there
are cases where something could move but if that's true I don't know
what those scenarios are.

Regards,

Brian