Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict

"Xiong, Jinshan" <jinshan.xiong@xxxxxxxxx> · Tue, 11 Feb 2014 06:55:59 +0000

On Feb 10, 2014, at 6:40 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx<mailto:viro@xxxxxxxxxxxxxxxxxx>> wrote:

On Tue, Feb 11, 2014 at 12:31:39AM +0000, Dilger, Andreas wrote:

The off-stack storage of per-thread info is used to avoid stack overflow
because of the 4kB stack limitation, and to avoid kmalloc/kfree of the
same data structures for every call into the callpath.  Other code in
the kernel handles similar problems by having a separate thread do the
lower level IO, but that adds context switches to the IO path.  I'm not
a big fan of this, but that is what we live with due to limited stack
size.

<wry> Reducing the depth of call chains also might help. </wry>
And yes, I understand the motivation; makes keeping track of what's going on
really unpleasant, though...

4) cl_io_init() which calls

5) cl_io_init0() which possibly calls a bunch of instances of
->coo_io_init(),
which may or may not return 0; I _hope_ that in this case that's what'll
happen and we get back to ll_file_io_generic(), where

Umm...  So am I right about the thing returning 0 in this case and in
case of ->aio_read()?

Yes, cio_init() can return one of the following integers:
 = 0: IO can go forward
 > 0: IO is not necessary
< 0: error occurs

So it will call aio_read() if cl_io_init() returns zero.

5') cl_io_start() which a calls a bunch (how large in that case?) of

6') ->cio_start() instances.  Hopefully that'll be vvp_io_write_start()

Again, is that assumption correct?  I'm not interested in criticizing
overall design, etc.; what I am interested in is what does that code
end up doing wrt carrying iov/iovlen/iocb/position through the entire
thing.

Yes, it calls vvp_io_write_start().

Lustre supports data striping, so a file can be composed of multiple OST(Object Storage Target) objects. When reading or writing a striped file, IO is performed stripe by stripe, this is why you can see the loop of cl_io_lock(), cl_io_start(), and cl_io_unlock(), etc.

cl_io_init();
while (still has data) {
   cl_io_iter_init();
   cl_io_lock();
   cl_io_start();
   cl_io_end();
   cl_io_unlock();
}
cl_io_fini();

Essentially, cl_io_iter_init() separates the IO by stripes, and followed by taking dlm lock, doing actual IO, and release lock, etc.

Let’s take an example here, say a file has two stripes on OST0 and OST1, and stripe size is 1M, so when 2M data is to be written to this file starting from zero, then the above code will loop twice to finish this IO.
The first iteration will write [0, 1M) to OST0, and the 2nd one will write [1M, 2M) to OST1. As a result, generic_file_aio_write() will be called twice specifically.

The reason for us to split the IO by stripes is due to cascading problems, which is another story.

which will pass the value of io->u.ci_rw.crw_pos (picked via an
overlapping field of union) to generic_file_aio_write().  Which, BTW,
updates ->ki_pos. Then we return into cl_io_loop(), where

4'') we call cl_io_end(), cl_io_unlock() and

5'') cl_io_rw_advance() which increments ....crw_pos, hopefully in sync
with what we have in iocb->ki_pos.  And calls a bunch of methods.  And
return into cl_io_loop(), where

3''') we copy ....crw_pos into iocb->ki_pos.  WTF do we need that?  Hadn't
generic_file_aio_write() been good enough?

Is it safe to assume that
* these two calls of generic_file_aio_{read,write} will always
get pos equal to iocb->ki_pos

Yes.

* iocb/iov/iovlen will be unchanged since the moment they'd been
passed to ll_file_aio_{read,write}()

Yes. They will be updated after an IO for one stripe is finished.

* ->ki_pos assigment after the call of cl_io_loop() is no-op for
IO_NORMAL case

Yes. But as you have already noticed, the ppos in ll_file_io_generic() is NOT always &iocb->ki_pos. So it isn’t totally wrong to have

“*ppos = io->u.ci_wr.wr.crw_pos;”

in ll_file_io_generic(), take a look at ll_file_splice_read().

* these calls of generic_file_aio_{read,write} will always have
ll_file_aio_{read,write} in the call chain and can't be reached otherwise?

Yes.

What I want to do is turn ->aio_read()/->aio_write() signature into
kiocb x iov_iter -> ssize_t.  With all instances, including
generic_file_aio_{read,write}() converted to that.   With pos delivered via
iocb and iov/nrsegs via iov_iter.  Supplying that in places where they
are called via method is trivial and for such calls we definitely have
pos == iocb->ki_pos.  Most of the places calling those instances directly

It’s a good to use iov_iter{} to replace iov/nrsegs.

Even this, we still have to remember pos in cl_io{} because the file position will be used at LOV layer to determine the stripe boundary.

(not via method) are in other instances of the same methods and pass the
arguments unchanged.  These two in lustre are, in fact, the only real
exceptions.  IF the hypotheses above are correct, we can convert those
two - just store iov_iter reference instead of iov/nrsegs in vvp_io_args and
in ccc_io and that'll do the trick.  Probably put iov_iter alongside the
vti_local_iovec as well, but I really wonder if that's worth the trouble -
might be better to use do_sync_read/do_sync_write and be done with that.

How do we handle ->splice_read() in this case. We can simplify read and write a little bit, but it seems we have to duplicate some code for splice_read(). Please advise.

Thank you very much for giving the advice.

Jinshan

Sure, it will slightly increase the stack footprint, but if you are _that_
tight on stack...  Anyway, that's a separate story - we can deal with that
later.

PS: I'm serious about the call graph, ideally along with the stack footprints
of all functions.  At least it would give some idea how much do various
paths contribute...  Of course, it's a real bitch to automate with the
amount of indirect calls you've got there - the tricky part is finding the
set of functions that might be called at given location ;-/

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html