Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict

"Dilger, Andreas" <andreas.dilger@xxxxxxxxx> · Tue, 11 Feb 2014 00:31:39 +0000

On 2014/02/10, 3:51 PM, "Al Viro" <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
>Egads.  Correct me if I'm wrong, please, but it looks like you have
>
>1) ->aio_write() stash its arguments in an off-stack struct and pass it
>and &iocb->ki_pos to
>
>2) ll_file_io_generic() which calls
>
>3) cl_io_rw_init() which copied ki_pos value passed to it into
>io->u.ci_rw.crw_pos (in another off-stack struct) and calls

The off-stack storage of per-thread info is used to avoid stack overflow
because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
same data structures for every call into the callpath.  Other code in
the kernel handles similar problems by having a separate thread do the
lower level IO, but that adds context switches to the IO path.  I'm not
a big fan of this, but that is what we live with due to limited stack
size.

>4) cl_io_init() which calls
>
>5) cl_io_init0() which possibly calls a bunch of instances of
>->coo_io_init(),
>which may or may not return 0; I _hope_ that in this case that's what'll
>happen and we get back to ll_file_io_generic(), where
>
>3') we stash iocb and iov/iovlen into the third off-stack structure (cio)
>and call

The reason for separate IO structures at different parts of the stack is
that the Lustre client code has different upper level VFS interfaces for
Linux, MacOS, WinNT, and a userspace library.  The kernel iocb is not
available in all of those cases.

Those non-Linux VFS interfaces are deprecated, and we are looking to
remove these abstractions again, but it will take time.

>4') cl_io_loop() where we (in a loop) call cl_io_iter_init(),
>cl_io_lock() and
>
>5') cl_io_start() which a calls a bunch (how large in that case?) of
>
>6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()
>which will pass the value of io->u.ci_rw.crw_pos (picked via an
>overlapping field of union) to generic_file_aio_write().  Which, BTW,
> updates ->ki_pos. Then we return into cl_io_loop(), where
>
>4'') we call cl_io_end(), cl_io_unlock() and
>
>5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
>with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
>return into cl_io_loop(), where
>
>4''') we call cl_io_iter_fini() (_another_ pile of methods called) and
> possibly repeat everything from (4') on (apparently only if nothing had
> been done so far).  Eventually we return into ll_file_io_generic() and
>there
>
>3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
>generic_file_aio_write() been good enough?
>
>Is that correct?  I have *not* traced it into all methods that might've
>been called in process - stuff called from cl_io_loop() is chock-full of
>those.  Have I missed anything relevant wrt file position handling in
>there?

This internal separation of the IO submission path was done to allow
parallel IO generation for multiple target devices and parallel RPC
generation.  Pretty much everyone (except original author, who no
longer works on Lustre) agrees that the level of abstraction is too
much, and we need to simplify it so that more than a handful of people
can even understand it.

>You guys really should be forced to hand-draw a call graph for that thing.
>Both as a punishment and a deterrent...

Believe me, this is one of my least favourite, and most complex parts
of the Lustre code, and we've had many internal discussions and plans
to clean it up.  Some of that cleanup work has already started, and more
is planned, but it will take a while to test it and push it upstream.

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html