Hello, Shaohua. On Wed, Aug 23, 2017 at 11:15:15AM -0700, Shaohua Li wrote: > loop block device handles IO in a separate thread. The actual IO > dispatched isn't cloned from the IO loop device received, so the > dispatched IO loses the cgroup context. Ah, right. Thanks for spotting this. > I'm ignoring buffer IO case now, which is quite complicated. Making the > loop thread aware of cgroup context doesn't really help. The loop device > only writes to a single file. In current writeback cgroup > implementation, the file can only belong to one cgroup. Yeah, writeback tracks the most active cgroup and associates writeback ios with that cgroup. For buffered loop devices, I think it'd be fine to make the loop driver forward the cgroup association and let the writeback layer determine the matching association as usual. > For direct IO case, we could workaround the issue in theory. For > example, say we assign cgroup1 5M/s BW for loop device and cgroup2 > 10M/s. We can create a special cgroup for loop thread and assign at > least 15M/s for the underlayer disk. In this way, we correctly throttle > the two cgroups. But this is tricky to setup. > > This patch tries to address the issue. When loop thread is handling IO, > it declares the IO is on behalf of the original task, then in block IO > we use the original task to find cgroup. The concept probably works for > other scenarios too, but right now I don't make it generic yet. The overall approach looks sound to me. Some comments below. > @@ -2058,7 +2058,10 @@ int bio_associate_current(struct bio *bio) > > get_io_context_active(ioc); > bio->bi_ioc = ioc; > - bio->bi_css = task_get_css(current, io_cgrp_id); > + if (current->cgroup_task) > + bio->bi_css = task_get_css(current->cgroup_task, io_cgrp_id); > + else > + bio->bi_css = task_get_css(current, io_cgrp_id); Do we need a full pointer field for this? I think we should be able to mark lo writers with a flag (PF or whatever) and then to kthread_data() to get the lo struct which contains the target css. Oh, let's do csses instead of tasks for consistency, correctness (please see below) and performance (csses are cheaper to pin). > @@ -502,11 +504,16 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, > cmd->iocb.ki_complete = lo_rw_aio_complete; > cmd->iocb.ki_flags = IOCB_DIRECT; > > + if (cmd->cgroup_task) > + current->cgroup_task = cmd->cgroup_task; > + > if (rw == WRITE) > ret = call_write_iter(file, &cmd->iocb, &iter); > else > ret = call_read_iter(file, &cmd->iocb, &iter); > > + current->cgroup_task = NULL; > + > if (ret != -EIOCBQUEUED) > cmd->iocb.ki_complete(&cmd->iocb, ret, 0); > return 0; Hmm... why do we need double forwarding (original issuer -> aio cmd -> ios) in loop? If we need this, doesn't this mean that we're missing ownership forwarding in aio paths and should make the forwarding happen there? > @@ -1705,6 +1712,11 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, > break; > } > > + if (cmd->use_aio) { > + cmd->cgroup_task = current; > + get_task_struct(current); > + } else > + cmd->cgroup_task = NULL; What if %current isn't the original issuer of the io? It could be a writeback worker trying to flush to a loop device and we don't want to attribute those ios to the writeback worker. We should forward the bi_css not the current task. Thanks! -- tejun