Re: [RFC] block/loop: make loop cgroup aware

Tejun Heo <tj@xxxxxxxxxx> · Mon, 28 Aug 2017 15:54:59 -0700

Hello, Shaohua.

On Wed, Aug 23, 2017 at 11:15:15AM -0700, Shaohua Li wrote:
> loop block device handles IO in a separate thread. The actual IO
> dispatched isn't cloned from the IO loop device received, so the
> dispatched IO loses the cgroup context.

Ah, right.  Thanks for spotting this.

> I'm ignoring buffer IO case now, which is quite complicated.  Making the
> loop thread aware of cgroup context doesn't really help. The loop device
> only writes to a single file. In current writeback cgroup
> implementation, the file can only belong to one cgroup.

Yeah, writeback tracks the most active cgroup and associates writeback
ios with that cgroup.  For buffered loop devices, I think it'd be fine
to make the loop driver forward the cgroup association and let the
writeback layer determine the matching association as usual.

> For direct IO case, we could workaround the issue in theory. For
> example, say we assign cgroup1 5M/s BW for loop device and cgroup2
> 10M/s. We can create a special cgroup for loop thread and assign at
> least 15M/s for the underlayer disk. In this way, we correctly throttle
> the two cgroups. But this is tricky to setup.
> 
> This patch tries to address the issue. When loop thread is handling IO,
> it declares the IO is on behalf of the original task, then in block IO
> we use the original task to find cgroup. The concept probably works for
> other scenarios too, but right now I don't make it generic yet.

The overall approach looks sound to me.  Some comments below.

> @@ -2058,7 +2058,10 @@ int bio_associate_current(struct bio *bio)
>  
>  	get_io_context_active(ioc);
>  	bio->bi_ioc = ioc;
> -	bio->bi_css = task_get_css(current, io_cgrp_id);
> +	if (current->cgroup_task)
> +		bio->bi_css = task_get_css(current->cgroup_task, io_cgrp_id);
> +	else
> +		bio->bi_css = task_get_css(current, io_cgrp_id);

Do we need a full pointer field for this?  I think we should be able
to mark lo writers with a flag (PF or whatever) and then to
kthread_data() to get the lo struct which contains the target css.
Oh, let's do csses instead of tasks for consistency, correctness
(please see below) and performance (csses are cheaper to pin).

> @@ -502,11 +504,16 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
>  	cmd->iocb.ki_complete = lo_rw_aio_complete;
>  	cmd->iocb.ki_flags = IOCB_DIRECT;
>  
> +	if (cmd->cgroup_task)
> +		current->cgroup_task = cmd->cgroup_task;
> +
>  	if (rw == WRITE)
>  		ret = call_write_iter(file, &cmd->iocb, &iter);
>  	else
>  		ret = call_read_iter(file, &cmd->iocb, &iter);
>  
> +	current->cgroup_task = NULL;
> +
>  	if (ret != -EIOCBQUEUED)
>  		cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
>  	return 0;

Hmm... why do we need double forwarding (original issuer -> aio cmd ->
ios) in loop?  If we need this, doesn't this mean that we're missing
ownership forwarding in aio paths and should make the forwarding
happen there?

> @@ -1705,6 +1712,11 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		break;
>  	}
>  
> +	if (cmd->use_aio) {
> +		cmd->cgroup_task = current;
> +		get_task_struct(current);
> +	} else
> +		cmd->cgroup_task = NULL;

What if %current isn't the original issuer of the io?  It could be a
writeback worker trying to flush to a loop device and we don't want to
attribute those ios to the writeback worker.  We should forward the
bi_css not the current task.

Thanks!

-- 
tejun