On 2023/9/28 Peter Zijlstra wrote:
On Wed, Sep 13, 2023 at 04:24:24PM +0800, Hao Jia wrote:
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning.
Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@xxxxxxxxx
Commit ebb83d84e49b54 ("sched/core: Avoid multiple
calling update_rq_clock() in __cfsb_csd_unthrottle()")
add RQCF_ACT_SKIP leak warning in rq_clock_start_loop_update().
But this warning is inaccurate and may be triggered
incorrectly in the following situations:
CPU0 CPU1
__schedule()
*rq->clock_update_flags <<= 1;* unregister_fair_sched_group()
pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth()
newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0*
rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle()
if (rq->clock_update_flags > RQCF_ACT_SKIP)
rf->clock_update_flags = RQCF_UPDATED;
so that preserves all flags, but only stores UPDATED.
raw_spin_rq_unlock(this_rq)
rq_lock(*CPU0_rq*, &rf)
rq_pin_lock()
rq->clock_update_flags &= (REQ_SKIP|ACT_SKIP);
rf->clock_update_flags = 0;
IOW, we preserve ACT_SKIP from CPU0
rq_clock_start_loop_update()
rq->clock_update_flags & RQCF_ACT_SKIP <--
And go SPLAT
raw_spin_rq_lock(this_rq)
rq_repin_lock()
rq->clock_update_flags |= rf->clock_update_flags;
which restores UPDATED, even though in reality time could have moved on
quite significantly.
Anyway....
the purpose of ACT_SKIP is to skip the update (clue in name etc), but
the update is very early in __schedule(), but we clear *_SKIP very late,
causing it to span that gap above.
Going by the commits that put it there, the thinking was to clear
clock_skip_update before unlock, but AFAICT we can clear SKIP flags
right after the update_rq_clock() we're wanting to skip, no?
Thanks for your review, and I am very sorry to reply to you so late. I
just came back from a long vacation.
That is, would not something like the below make more sense?
If we understand correctly, this may not work.
After applying this patch, the following situation will trigger the
rq->clock_update_flags < RQCF_ACT_SKIP warning.
If rq_clock_skip_update() is called before __schedule(), so
RQCF_REQ_SKIP of rq->clock_update_flags is set.
__schedule() {
rq_lock(rq, &rf); [rq->clock_update_flags is RQCF_REQ_SKIP]
rq->clock_update_flags <<= 1;
update_rq_clock(rq); [rq->clock_update_flags is RQCF_ACT_SKIP]
+ rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
* At this time, rq->clock_update_flags = 0; *
pick_next_task_fair
set_next_entity
update_load_avg
assert_clock_updated() <---
}
assert_clock_updated() will determine whether rq->clock_update_flags is
less than RQCF_ACT_SKIP. If we clear RQCF_ACT_SKIP prematurely, this
assert may be triggered later.
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8fd29d66b24..bfd2ab4b95da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5357,8 +5357,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
- rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
-
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
@@ -6596,6 +6594,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
+ rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
+
switch_count = &prev->nivcsw;
@@ -6675,8 +6675,6 @@ static void __sched notrace __schedule(unsigned int sched_mode)
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
- rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
-
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);