Re: [PATCH 3/5] fs: remove ki_nbytes

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Wed, 4 Feb 2015 23:07:34 +0000



On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote:
> > 	* this one.  Note that you are not guaranteed that ep_config() won't
> > be called more than once - two threads might race in write(2), with the loser
> > getting through mutex_lock_interruptible(&data->lock); in ep_config() only
> > after the winner has already gotten through write(), switched ->f_op, returned
> > to userland and started doing read()/write()/etc.  If nothing else,
> > the contents of data->desc and data->hs_desc can be buggered by arbitrary
> > data, no matter how bogus, right as the first thread is doing IO.
> 
> Well, this one certainly can be fixed to avoid altering ->f_op, at the 
> cost of adding an extra check at the start of each I/O operation.
 
> > Um...  readv() is also going through ->aio_read().
> 
> Why does readv() do this but not read()?  Wouldn't it make more sense 
> to have all the read* calls use the same internal interface?

Because there are two partially overlapping classes wrt vector IO semantics:
	1) datagram-style.  Vectored read/write is equivalent to simple
read/write done on each vector component.  And IO boundaries matter - if
your driver treats any write() as datagram that starts e.g. with
fixed-sized table in the beginning + arbitrary amount of data following
it, you will get very different results from write(fd, buf, 200) and
writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2).  A _lot_ of
drivers are like that - they supply ->read() and ->write() for single-range
IO and VFS construct the rest of operations out of those.

	2) stream-style.  Vectored read is guaranteed to behave the same
way as simple read on a range with size being the sum of vector element
sizes, except that the data ends in ranges covered by vector elements instead
of a single array.  Vectored write is guaranteed to behave the same way
as simple write from a buffer containing the concatenation of ranges covered
by vector elements.  Boundaries between the elements do not matter at all.
Regular files on storage filesystems are like that.  So are FIFOs and pipes
and so are sockets.  Even for datagram protocols, boundaries between the
vector elements are ignored; boundaries between syscalls provide the datagram
boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){
{const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have
only one UDP packet sent.  IOW, it's general-purpose scatter-gather for read
and write.

	The last example shows that (2) isn't a subset of (1) - it's not
always possible to call ->write() in loop and get the right behaviour.
For regular files (and pure stream sockets, etc.) it would work, but for
stuff like UDP sockets it would break.  Moreover, even for regular files on
storage filesystems it would be quite inefficient - we'd need to acquire and
release a bunch of locks, poke through metadata, etc., for each segment.

	As the result, there was a couple of new methods added, inventively
called ->readv() and ->writev().  do_sync_read() was supposed to be used
as ->read() instance - it's "feed a single-element vector to ->readv()" and
similar for s/read/write/.

	Note that both (1) and (2) satisfy the following sanity requirement -
single-element readv() is always equivalent to single-element() read().  You
could violate that, by providing completely unrelated ->read() and ->readv(),
but very few drivers went for that - too insane.

	Then, when AIO had been added, those had grown an argument pointing
to iocb (instead of file and ppos - for those we use iocb->ki_filp and
&iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write().
Note that non-vectored AIO uses the same methods - ->read() and ->write() had
too many instances to convert and most of those would end up just using those
two iocb fields instead of the old arguments - tons of churn for no good
reason.  ->readv() and ->writev() had fewer instances (very common was the
use of generic_file_aio_{read,write}()) and conversion was less painful.
So there was no ->aio_read() and ->aio_write().  That, in principle, was a
bit of constraint - you couldn't make single-element AIO_PREADV behave
different from AIO_PREAD, but nobody had been insane enough to ask for that.

	Moreover, keeping ->readv() and ->writev() was pointless. There is
cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2)
or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if
driver really wanted different semantics for sync vs. async, it could check
that.

	So we ended up with ->read/->write for sync non-vectored and
->aio_read()/->aio_write() for sync vectored *and* async anything.  Usually
you provide one or the other - NULL ->aio_... means loop calling ->read/write
on each element, NULL ->read/write (or do_sync_... for them - it's the same
thing) means feeding sync iocb and single-element vector to ->aio_....
You *can* provide both, but that's rare and almost always pointless.

	These days ->aio_{read,write} is almost gone - replaced with
->{read,write}_iter().  That sucker is equivalent, except that you
get a pointer struct iov_iter instead iovec, nr_seg, size triple.  And
you use linux/uio.h primitives for accessing the data (it might be backed
by something other than userland ranges).  The primitives in there are
not harder to use than copy_..._user(), and you avoid any need to loop over
iovec array, etc.  Those primitives generally don't give a damn about the
range boundaries; the only exception is iov_iter_single_seg_count(), which
tells how much data is left to be consumed in the current segment; very
few things are using it.  is_sync_kiocb() is still available to tell io_submit
from synchronous syscalls.

	I don't believe that it's worth switching the (1) \setminus (2) to
those; it's still too much churn, plus we'd need the loop over segments
somewhere to keep the semantics.  It *can* be expressed by ->read_iter()
and ->write_iter(), but it'll be tons of boilerplate code in a lot of
drivers.  So ->read() and ->write() are there to stay.  However, I think
that we can get to the situation when it's really either one or another -
if you have ->read_iter()/->write_iter(), don't mess with custom ->read()
and ->write().

	Both function/f_fs.c and legacy/inode.c are in class (2) - they
gather data from iovec into temp buffer in ->aio_write() and pass the buffer to 
the code doing actual IO, and on ->aio_read() side they give a buffer to
read into, then arrange for it to be scattered into our iovec upon IO
completion.  And they are doing non-trivial async stuff, so I'd prefer to
switch them completely to ->read_iter/->write_iter.  The only obstacle is
read vs. single-element readv and write vs. single-element writev behaviour
difference.

	For some files it really can't be helped - {ipath,qib}fs has completely
unrelated sets of commands accepted by write and writev, including the
single-element one.  And /dev/snd/pcm*, while somewhat milder, has non-trivial
behaviour differences.  I think that trick suggested by hch (put several
flags in iocb, encoding the reason for the call; that would allow to tell
one from another) is the best way to deal with those.  But I would really
prefer to avoid that; IMO those examples are simply bad userland APIs.
legacy/inode.c is the third (and last) beast like those two.  And there the
difference is small enough to simply change readv/writev to be like read/write
in that respect, i.e. also halt the endpoint when called on the isochronous
one with wrong direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html