Hello Sebastian, On Fri, Nov 12, 2021 at 03:54:52PM +0100, Sebastian Andrzej Siewior wrote: > On 2021-11-10 11:45:57 [+0100], Uwe Kleine-König wrote: > > recently I debugged a problem on an -rt enabled kernel. The relevant > > part of the analysed trace looks as follows: > > > > napi/can0-10-360 [001] d...312 3565.642595: sched_pi_setprio: comm=candump pid=2182 oldprio=120 newprio=14 > > napi/can0-10-360 [001] d...212 3565.642619: sched_switch: prev_comm=napi/can0-10 prev_pid=360 prev_prio=14 prev_state=R ==> next_comm=cantest next_pid=915 next_prio=39 > > .... > > rcuc/0-15 [000] d...212 3565.642633: sched_switch: prev_comm=rcuc/0 prev_pid=15 prev_prio=98 prev_state=R+ ==> next_comm=candump next_pid=2182 next_prio=14 > > candump-2182 [000] d...3.. 3565.642646: sched_pi_setprio: comm=candump pid=2182 oldprio=14 newprio=120 > > > > So the napi/can0-10 wants to grab a mutex that candump is holding. So > > candump's priority is bumped from 120 to 14. > > > > However the napi/can0-10 process (and a few others) are pinned to cpu #1 > > and cantest isn't allowed to run on that one. And so cpu #1 schedules a > > lower prio task while candump still has to wait a moment before being > > scheduled on cpu #0. > > > > I wonder if it would be sensible in such a case not only to increase the > > importance of candump, but also to allow it to run on the cpu-set the > > boosting process is allowed to run on until it releases the mutex. > > > > Would that make sense? > > Sounds like your problem could be solved by allowing candump to run on > any CPU. Why not lift that restriction yourself? I want to give two answers here: - There are some realtime requirements on this machine. To get the latency of the relevant userspace application down, cpu #1 is isolated and only runs the application (here in this test "cantest"), the can napi thread and the can irq thread. Do you want to suggest that this isn't a good idea? - Consider three processes A, B and C with increasing priorities (so C is the most important). If A holds a lock that C wants to grab, the kernel today already ensures (in the absence of cpu restrictions) that A is scheduled before B. In the presence of cpu restrictions this fails as this case shows: B and C are pinned to cpu #1, A must not run on cpu #1. Then it can happen that C waits for A but cpu #1 schedules B even though C being blocked should be more important than to run B and so A should be run. So I think while you are right that I could just allow candump to run on cpu #1 here, this is a corner case where the priority inversion handling isn't doing the right thing. > From the trace, you have migration disabled for napi/can0. So that one > can't be moved. If the lock, that is owned by candump, is a spinlock_t > than candump has also migration disabled and can't be moved either. Ah, I didn't know that holding a spinlock implies disabled migration. In the !RT case this is obvious, with RT not so?! Then boosting not only the priority but also the set of cpus the process can run on isn't as effective as I expected. Best regards and thanks for your response, Uwe -- Pengutronix e.K. | Uwe Kleine-König | Industrial Linux Solutions | https://www.pengutronix.de/ |
Attachment:
signature.asc
Description: PGP signature