Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Tue, 11 Feb 2014 02:40:10 +0000

On Tue, Feb 11, 2014 at 12:31:39AM +0000, Dilger, Andreas wrote:

> The off-stack storage of per-thread info is used to avoid stack overflow
> because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
> same data structures for every call into the callpath.  Other code in
> the kernel handles similar problems by having a separate thread do the
> lower level IO, but that adds context switches to the IO path.  I'm not
> a big fan of this, but that is what we live with due to limited stack
> size.

<wry> Reducing the depth of call chains also might help. </wry>
And yes, I understand the motivation; makes keeping track of what's going on
really unpleasant, though...

> >4) cl_io_init() which calls
> >
> >5) cl_io_init0() which possibly calls a bunch of instances of
> >->coo_io_init(),
> >which may or may not return 0; I _hope_ that in this case that's what'll
> >happen and we get back to ll_file_io_generic(), where

Umm...  So am I right about the thing returning 0 in this case and in
case of ->aio_read()?

> >5') cl_io_start() which a calls a bunch (how large in that case?) of
> >
> >6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()

Again, is that assumption correct?  I'm not interested in criticizing
overall design, etc.; what I am interested in is what does that code
end up doing wrt carrying iov/iovlen/iocb/position through the entire
thing.

> >which will pass the value of io->u.ci_rw.crw_pos (picked via an
> >overlapping field of union) to generic_file_aio_write().  Which, BTW,
> > updates ->ki_pos. Then we return into cl_io_loop(), where
> >
> >4'') we call cl_io_end(), cl_io_unlock() and
> >
> >5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
> >with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
> >return into cl_io_loop(), where

> >3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
> >generic_file_aio_write() been good enough?

Is it safe to assume that
	* these two calls of generic_file_aio_{read,write} will always
get pos equal to iocb->ki_pos
	* iocb/iov/iovlen will be unchanged since the moment they'd been
passed to ll_file_aio_{read,write}()
	* ->ki_pos assigment after the call of cl_io_loop() is no-op for
IO_NORMAL case
	* these calls of generic_file_aio_{read,write} will always have
ll_file_aio_{read,write} in the call chain and can't be reached otherwise?

What I want to do is turn ->aio_read()/->aio_write() signature into
kiocb x iov_iter -> ssize_t.  With all instances, including
generic_file_aio_{read,write}() converted to that.   With pos delivered via
iocb and iov/nrsegs via iov_iter.  Supplying that in places where they
are called via method is trivial and for such calls we definitely have
pos == iocb->ki_pos.  Most of the places calling those instances directly
(not via method) are in other instances of the same methods and pass the
arguments unchanged.  These two in lustre are, in fact, the only real
exceptions.  IF the hypotheses above are correct, we can convert those
two - just store iov_iter reference instead of iov/nrsegs in vvp_io_args and
in ccc_io and that'll do the trick.  Probably put iov_iter alongside the
vti_local_iovec as well, but I really wonder if that's worth the trouble -
might be better to use do_sync_read/do_sync_write and be done with that.
Sure, it will slightly increase the stack footprint, but if you are _that_
tight on stack...  Anyway, that's a separate story - we can deal with that
later.

PS: I'm serious about the call graph, ideally along with the stack footprints
of all functions.  At least it would give some idea how much do various
paths contribute...  Of course, it's a real bitch to automate with the
amount of indirect calls you've got there - the tricky part is finding the
set of functions that might be called at given location ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html