On Mon, May 03, 2021 at 01:49:17PM -0400, Johannes Weiner wrote: > 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") > introduced a race condition that corrupts internal psi state. This > manifests as kernel warnings, sometimes followed by bogusly high IO > pressure: > > psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0 > (schedule() decreasing RUNNING and ONCPU, both of which are 0) > > psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0 > (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU) > > What the offending commit does is batch the two psi callbacks in > schedule() to reduce the number of cgroup tree updates. When prev is > deactivated and removed from the runqueue, nothing is done in psi at > first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are > updated along with TSK_ONCPU. > > However, the deactivation and the task switch inside schedule() aren't > atomic: pick_next_task() may drop the rq lock for load balancing. When > this happens, cgroup_move_task() can run after the task has been > physically dequeued, but the psi updates are still pending. Since it > looks at the task's scheduler state, it doesn't move everything to the > new cgroup that the task switch that follows is about to clear from > it. cgroup_move_task() will leak the TSK_RUNNING count in the old > cgroup, and psi_sched_switch() will underflow it in the new cgroup. > > A similar thing can happen for iowait. TSK_IOWAIT is usually set when > a p->in_iowait task is dequeued, but again this update is deferred to > the switch. cgroup_move_task() can see an unqueued p->in_iowait task > and move a non-existent TSK_IOWAIT. This results in the inconsistent > task state warning, as well as a counter underflow that will result in > permanent IO ghost pressure being reported. > > Fix this bug by making cgroup_move_task() use task->psi_flags instead > of looking at the potentially mismatching scheduler state. > > [ We used the scheduler state historically in order to not rely on > task->psi_flags for anything but debugging. But that ship has sailed > anyway, and this is simpler and more robust. > > We previously already batched TSK_ONCPU clearing with the > TSK_RUNNING update inside the deactivation call from schedule(). But > that ordering was safe and didn't result in TSK_ONCPU corruption: > unlike most places in the scheduler, cgroup_move_task() only checked > task_current() and handled TSK_ONCPU if the task was still queued. ] > > Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Thanks! queued for sched/urgent.