On 2014/02/10, 3:51 PM, "Al Viro" <viro@xxxxxxxxxxxxxxxxxx> wrote: > >Egads. Correct me if I'm wrong, please, but it looks like you have > >1) ->aio_write() stash its arguments in an off-stack struct and pass it >and &iocb->ki_pos to > >2) ll_file_io_generic() which calls > >3) cl_io_rw_init() which copied ki_pos value passed to it into >io->u.ci_rw.crw_pos (in another off-stack struct) and calls The off-stack storage of per-thread info is used to avoid stack overflow because of the 4kB stack limitation, and to avoid kmalloc/kfree of the same data structures for every call into the callpath. Other code in the kernel handles similar problems by having a separate thread do the lower level IO, but that adds context switches to the IO path. I'm not a big fan of this, but that is what we live with due to limited stack size. >4) cl_io_init() which calls > >5) cl_io_init0() which possibly calls a bunch of instances of >->coo_io_init(), >which may or may not return 0; I _hope_ that in this case that's what'll >happen and we get back to ll_file_io_generic(), where > >3') we stash iocb and iov/iovlen into the third off-stack structure (cio) >and call The reason for separate IO structures at different parts of the stack is that the Lustre client code has different upper level VFS interfaces for Linux, MacOS, WinNT, and a userspace library. The kernel iocb is not available in all of those cases. Those non-Linux VFS interfaces are deprecated, and we are looking to remove these abstractions again, but it will take time. >4') cl_io_loop() where we (in a loop) call cl_io_iter_init(), >cl_io_lock() and > >5') cl_io_start() which a calls a bunch (how large in that case?) of > >6') ->cio_start() instances. Hopefully that'll be vvp_io_write_start() >which will pass the value of io->u.ci_rw.crw_pos (picked via an >overlapping field of union) to generic_file_aio_write(). Which, BTW, > updates ->ki_pos. Then we return into cl_io_loop(), where > >4'') we call cl_io_end(), cl_io_unlock() and > >5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync >with what we have in iocb->ki_pos. And calls a bunch of methods. And >return into cl_io_loop(), where > >4''') we call cl_io_iter_fini() (_another_ pile of methods called) and > possibly repeat everything from (4') on (apparently only if nothing had > been done so far). Eventually we return into ll_file_io_generic() and >there > >3''') we copy ....crw_pos into iocb->ki_pos. WTF do we need that? Hadn't >generic_file_aio_write() been good enough? > >Is that correct? I have *not* traced it into all methods that might've >been called in process - stuff called from cl_io_loop() is chock-full of >those. Have I missed anything relevant wrt file position handling in >there? This internal separation of the IO submission path was done to allow parallel IO generation for multiple target devices and parallel RPC generation. Pretty much everyone (except original author, who no longer works on Lustre) agrees that the level of abstraction is too much, and we need to simplify it so that more than a handful of people can even understand it. >You guys really should be forced to hand-draw a call graph for that thing. >Both as a punishment and a deterrent... Believe me, this is one of my least favourite, and most complex parts of the Lustre code, and we've had many internal discussions and plans to clean it up. Some of that cleanup work has already started, and more is planned, but it will take a while to test it and push it upstream. Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html