Hi Reinette, Fenghua, Subject nit: I think 'use IPI instead of task_work() to update PQR_ASSOC_MSR' conveys the guts of this change more quickly! On 03/12/2020 23:25, Reinette Chatre wrote: > From: Fenghua Yu <fenghua.yu@xxxxxxxxx> > > Currently when moving a task to a resource group the PQR_ASSOC MSR > is updated with the new closid and rmid in an added task callback. > If the task is running the work is run as soon as possible. If the > task is not running the work is executed later > in the kernel exit path when the kernel returns to the task again. kernel exit makes me thing of user-space... is it enough to just say: "by __switch_to() when task is next run"? > Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task > is running is the right thing to do. Queueing work for a task that is > not running is unnecessary (the PQR_ASSOC MSR is already updated when the > task is scheduled in) and causing system resource waste with the way in > which it is implemented: Work to update the PQR_ASSOC register is queued > every time the user writes a task id to the "tasks" file, even if the task > already belongs to the resource group. This could result in multiple pending > work items associated with a single task even if they are all identical and > even though only a single update with most recent values is needed. > Specifically, even if a task is moved between different resource groups > while it is sleeping then it is only the last move that is relevant but > yet a work item is queued during each move. > This unnecessary queueing of work items could result in significant system > resource waste, especially on tasks sleeping for a long time. For example, > as demonstrated by Shakeel Butt in [1] writing the same task id to the > "tasks" file can quickly consume significant memory. The same problem > (wasted system resources) occurs when moving a task between different > resource groups. > > As pointed out by Valentin Schneider in [2] there is an additional issue with > the way in which the queueing of work is done in that the task_struct update > is currently done after the work is queued, resulting in a race with the > register update possibly done before the data needed by the update is available. > > To solve these issues, the PQR_ASSOC MSR is updated in a synchronous way > right after the new closid and rmid are ready during the task movement, > only if the task is running. If a moved task is not running nothing is > done since the PQR_ASSOC MSR will be updated next time the task is scheduled. > This is the same way used to update the register when tasks are moved as > part of resource group removal. (as t->on_cpu is already used...) Reviewed-by: James Morse <james.morse@xxxxxxx> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > index 68db7d2dec8f..9d62f1fadcc3 100644 > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c > @@ -525,6 +525,16 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp) > +static void update_task_closid_rmid(struct task_struct *t) > { > + int cpu; > > + if (task_on_cpu(t, &cpu)) > + smp_call_function_single(cpu, _update_task_closid_rmid, t, 1); I think: | if (task_curr(t)) | smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); here would make for an easier backport as it doesn't depend on the previous patch. > +} [...] > static int __rdtgroup_move_task(struct task_struct *tsk, > struct rdtgroup *rdtgrp) > { > + if (rdtgrp->type == RDTCTRL_GROUP) { > + tsk->closid = rdtgrp->closid; > + tsk->rmid = rdtgrp->mon.rmid; > + } else if (rdtgrp->type == RDTMON_GROUP) { [...] > + } else { > + rdt_last_cmd_puts("Invalid resource group type\n"); > + return -EINVAL; Wouldn't this be a kernel bug? I'd have thought there would be a WARN_ON_ONCE() here to make it clear this isn't user-space's fault! > } > - return ret; > + > + /* > + * By now, the task's closid and rmid are set. If the task is current > + * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource > + * group go into effect. If the task is not current, the MSR will be > + * updated when the task is scheduled in. > + */ > + update_task_closid_rmid(tsk); > + > + return 0; > } Thanks, James