Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict

"Xiong, Jinshan" <jinshan.xiong@xxxxxxxxx> · Tue, 11 Feb 2014 18:26:12 +0000

On Feb 11, 2014, at 6:25 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:

> On Tue, Feb 11, 2014 at 06:55:59AM +0000, Xiong, Jinshan wrote:
>>> 4) cl_io_init() which calls
>>> 
>>> 5) cl_io_init0() which possibly calls a bunch of instances of
>>> ->coo_io_init(),
>>> which may or may not return 0; I _hope_ that in this case that's what'll
>>> happen and we get back to ll_file_io_generic(), where
>> 
>>> Umm...  So am I right about the thing returning 0 in this case and in
>>> case of ->aio_read()?
>> 
>> Yes, cio_init() can return one of the following integers:
>> = 0: IO can go forward
>>> 0: IO is not necessary
>> < 0: error occurs
>> 
>> So it will call aio_read() if cl_io_init() returns zero.
> 
> Er...  The question is the other way round, actually - will it return
> zero when we arrive there from ll_file_aio_read()?  Or could it happen
> that it returns something positive in that case?

Ah sorry, I missed it.

For ->aio_read(), yes mostly it will return zero so that IO can go forward. It won’t return a positive value for read case.

It can also return -ENODATA for HSM(Hierarchical Storage Management) where the data has already been moved to secondary storage so Lustre has to copy it back and then restart the IO.

> 
>> Let’s take an example here, say a file has two stripes on OST0 and OST1, and stripe size is 1M, so when 2M data is to be written to this file starting from zero, then the above code will loop twice to finish this IO.
>> The first iteration will write [0, 1M) to OST0, and the 2nd one will write [1M, 2M) to OST1. As a result, generic_file_aio_write() will be called twice specifically.
> 
> _Twice_?  Oh, I see... ccc_io_advance(), right?  With iov_iter the whole
> ->cio_advance() thing would die, AFAICS.

Yes, this will simplify it. This is a typically once-it-works-never-put-your-hands-on-it thing. I will create a patch for this.

> 
>> Yes. But as you have already noticed, the ppos in ll_file_io_generic() is NOT always &iocb->ki_pos. So it isn’t totally wrong to have
>> 
>> “*ppos = io->u.ci_wr.wr.crw_pos;”
>> 
>> in ll_file_io_generic(), take a look at ll_file_splice_read().
> 
> ->splice_read/->splice_write is another part of that story; eventually,
> I hope to get rid of those guys, when we get polymorphic iov_iter sorted
> out.  Short version of the story: turn iov_iter into a tagged structure with
> variants; primitives working with it would check the tag to decide what to
> do.  Think of *BSD struct uio on steroids; AFAIK, Solaris has kept it as
> well.  One of the cases would be iovec-based, with or without splitting the
> kernel-space iovec into separate case.  That's pretty much what *BSD one
> provides and what iov_iter does right now.  Another case: array of
> <struct page *page, size_t offset, size_t size> triples.  ->splice_write()
> would simply set such an array for all non-empty pipe_buffers and pass
> it to ->write_iter().  That gives a usable generic implementation, suitable
> for pretty much all filesystems.  _Maybe_ it can even be taught to deal
> with page stealing, in a way that would allow to kill ->splice_write() as
> a method - I didn't get to the point where it would be easy to investigate.
> ->splice_read() ought to use another iov_iter case, where copy_page_to_iter()
> would grab a reference to page and plain copy_to_iter() would allocate
> a page if needed and copy over there.

So essentially, iov_iter{} will be extended to include enough information to perform all kinds of IO operations to operate file data. Therefore, generic_file_aio_{read,write} will become the only two interfaces to initiate file IO operations, with iov_iter{} as one of their parameters.

I will keep an eye on it and once it’s implemented, I will revise lustre implementation correspondingly. 

> 
> Hell knows; I'm still dealing with preliminary messes and with splice_write
> side of things.  We'll see what set of primitives will it shake down to.
> So far it's been spread between several fsdevel threads (and some bits of
> off-list mail when potentially nasty bugs got caught).  I hope to get enough
> of the queue stable enough and large enough to make it worth discussing.
> Hopefully today or tomorrow there'll be enough meat on it...  In any case
> I'm going to bring it up on LSF/MM.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html