Re: [RFC 10/12] cgroup/drm: Introduce weight based drm cgroup control

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 27 Jan 2023 13:31:54 +0000

On 27/01/2023 13:01, Michal Koutný wrote:
On Thu, Jan 12, 2023 at 04:56:07PM +0000, Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
+static int drmcs_can_attach(struct cgroup_taskset *tset)
+{
+	int ret;
+
+	/*
+	 * As processes are getting moved between groups we need to ensure
+	 * both that the old group does not see a sudden downward jump in the
+	 * GPU utilisation, and that the new group does not see a sudden jump
+	 * up with all the GPU time clients belonging to the migrated process
+	 * have accumulated.
+	 *
+	 * To achieve that we suspend the scanner until the migration is
+	 * completed where the resume at the end ensures both groups start
+	 * observing GPU utilisation from a reset state.
+	 */
+
+	ret = mutex_lock_interruptible(&drmcg_mutex);
+	if (ret)
+		return ret;
+	start_suspend_scanning();
+	mutex_unlock(&drmcg_mutex);
+
+	finish_suspend_scanning();

Here's scanning suspension, communicated via

	root_drmcs.scanning_suspended = true;
	root_drmcs.suspended_period_us = root_drmcs.period_us;
	root_drmcs.period_us = 0;

but I don't see those used in scan_worker() and the scanning traversal
can apparently run concurrently with a task migration.

I think you missed the finish_suspend_scanning() part:

	if (root_drmcs.suspended_period_us)
		cancel_delayed_work_sync(&root_drmcs.scan_work);

So if scanning was in progress migration will wait until it finishes. 
And re-start only when migration is done (drmcs_attach), or it failed 
(drmcs_cancel_attach).

Not claiming I did not miss something because I was totally new with 
cgroup internals when I started working on this. So it is definitely 
useful to have more eyes looking.

[...]
+static bool
+__start_scanning(struct drm_cgroup_state *root, unsigned int period_us)
[...]
+	css_for_each_descendant_post(node, &root->css) {
[...]
+		active = drmcs_get_active_time_us(drmcs);
+		if (period_us && active > drmcs->prev_active_us)
+			drmcs->active_us += active - drmcs->prev_active_us;
+		drmcs->prev_active_us = active;

drmcs_get_active_time_us() could count a task's contribution here,
the task would migrate to a different drmcs,
and it'd be counted 2nd time.

Lets see.. __start_scanning() can be called from the worker, so max one 
instance at a time, no issue.

Then from resume scanning, so it is guaranteed worker is not running and 
can't restart since mutex guards the re-start.

Finally from drmcs_write_period_us() - yes there __start_scanning() can 
race with it being invoked by the worker - oops! However.. this is just 
a debugging aid as the cover letter explains. This file is not intended 
to be present in the final version, rather as per earlier discussion 
with Tejun the idea is to only have boot time option to control the 
functionality (enable/disable or period).

I will nevertheless try to fix this race up for the next posting to 
avoid further confusion!

Regards,

Tvrtko