Eric, On 3/16/22 7:26 PM, Eric Wheeler wrote: > [Some people who received this message don't often get email from linux-block@xxxxxxxxxxxxxxxxxx. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] > > Hi Ming, > > I was studying the loop.c DIO & AIO changes you made back in 2015 that > increased loop performance and reduced the memory footprint > (bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6). > > I have a few questions if you are able to comment, here is a quick > summary: > > The direct IO path starts by queuing the work: > > .queue_rq = loop_queue_rq: > > -> loop_queue_work(lo, cmd); > -> INIT_WORK(&worker->work, loop_workfn); > ... queue_work(lo->workqueue, work); > > Then from within the workqueue: > > -> loop_workfn() > -> loop_process_work(worker, &worker->cmd_list, worker->lo); > -> loop_handle_cmd(cmd); > -> do_req_filebacked(lo, blk_mq_rq_from_pdu(cmd) ); > -> lo_rw_aio(lo, cmd, pos, READ) // (or WRITE) > > From here the kiocb is setup and this is the 5.17-rc8 code at the > bottom of lo_rw_aio() when it sets up the dispatch to the filesystem: > > cmd->iocb.ki_pos = pos; > cmd->iocb.ki_filp = file; > cmd->iocb.ki_complete = lo_rw_aio_complete; > cmd->iocb.ki_flags = IOCB_DIRECT; > cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); > > if (rw == WRITE) > ret = call_write_iter(file, &cmd->iocb, &iter); > else > ret = call_read_iter(file, &cmd->iocb, &iter); > > lo_rw_aio_do_completion(cmd); > > if (ret != -EIOCBQUEUED) > lo_rw_aio_complete(&cmd->iocb, ret); > > > After having called `call_read_iter` it is in the filesystem's > handler. > > Since ki_complete is defined, does that mean the filesystem will _always_ > take these in and always queue these internally and return -EIOCBQUEUED > from call_read_iter()? Another way to ask: if ki_complete!=NULL, can a > filesystem ever behave synchronously? (Is there documentation about this > somewhere? I couldn't find anything definitive.) > a non-null ki_complete asks for async I/O and that is what we need to get the higher performance. > > About the cleanup after dispatch at the bottom of lo_rw_aio() from this > code (also shown above): > > lo_rw_aio_do_completion(cmd); > > if (ret != -EIOCBQUEUED) > lo_rw_aio_complete(&cmd->iocb, ret); > > * It appears that lo_rw_aio_do_completion() will `kfree(cmd->bvec)`. If > the filesystem queued the cmd->iocb for internal use, would it have made > a copy of cmd->bvec so that this is safe? > > * If ret != -EIOCBQUEUED, then lo_rw_aio_complete() is called which calls > lo_rw_aio_do_completion() a second time. Now lo_rw_aio_do_completion > does do this ref check, so it _is_ safe: > > if (!atomic_dec_and_test(&cmd->ref)) > return; > > For my own understanding, is this equivalent? > > - lo_rw_aio_do_completion(cmd); > > if (ret != -EIOCBQUEUED) > lo_rw_aio_complete(&cmd->iocb, ret); > + else > + lo_rw_aio_do_completion(cmd); > > > > I think the purpose of refcount is to make sure we free the request in lo_rw_aio_do_completion() whoever finishes last either submission thread or fs completion ctx calling ki_complete() -> lo_rw_aio_complete(). So there are actually three cases :- 1. I/O is successfully queued i.e. call_iter() ret == -EIOCBQUEUED. Case 1 : 1.1 fs completion happnes after we exit from lo_rw_aio() a. submission thread lo_rw_aio() refcnt = 2 b. submission thread lo_rq_aio_do_completion() refcnt = 1 c. fs completion ctx fs->ki_complete()->lo_rw_aio_complete() refcnt = 0 Case 2: 1.2 fs completion happens before we exit lo_rq_aio() a. submission thread lo_rw_aio() refcnt = 2 b. fs completion ctx fs->ki_complete()->lo_rw_aio_complete() refcnt = 1 c. submission thread lo_rq_aio_do_completion() refcnt = 0 2. I/O is not successfully queued i.e. call_iter() ret != -EIOCBQUEUED. Case 3: a. submission thread lo_rw_aio() refcnt = 2 b. submission thread lo_rq_aio_do_completion() refcnt = 1 c. submission thread lo_rw_aio_complete() refcnt = 0 so if you change the position of the call lo_rw_aio_do_completion() it might not work since refcount will be decremented by only once. hope this helps, if it creates more confusion then plz ignore this and follow Ming's reply. -ck