On Wed, Feb 04, 2015 at 03:30:32PM -0500, Alan Stern wrote: > > * this one. Note that you are not guaranteed that ep_config() won't > > be called more than once - two threads might race in write(2), with the loser > > getting through mutex_lock_interruptible(&data->lock); in ep_config() only > > after the winner has already gotten through write(), switched ->f_op, returned > > to userland and started doing read()/write()/etc. If nothing else, > > the contents of data->desc and data->hs_desc can be buggered by arbitrary > > data, no matter how bogus, right as the first thread is doing IO. > > Well, this one certainly can be fixed to avoid altering ->f_op, at the > cost of adding an extra check at the start of each I/O operation. > > Um... readv() is also going through ->aio_read(). > > Why does readv() do this but not read()? Wouldn't it make more sense > to have all the read* calls use the same internal interface? Because there are two partially overlapping classes wrt vector IO semantics: 1) datagram-style. Vectored read/write is equivalent to simple read/write done on each vector component. And IO boundaries matter - if your driver treats any write() as datagram that starts e.g. with fixed-sized table in the beginning + arbitrary amount of data following it, you will get very different results from write(fd, buf, 200) and writev(fd, (struct iovec[2]){{buf, 100}, {buf+100, 100}}, 2). A _lot_ of drivers are like that - they supply ->read() and ->write() for single-range IO and VFS construct the rest of operations out of those. 2) stream-style. Vectored read is guaranteed to behave the same way as simple read on a range with size being the sum of vector element sizes, except that the data ends in ranges covered by vector elements instead of a single array. Vectored write is guaranteed to behave the same way as simple write from a buffer containing the concatenation of ranges covered by vector elements. Boundaries between the elements do not matter at all. Regular files on storage filesystems are like that. So are FIFOs and pipes and so are sockets. Even for datagram protocols, boundaries between the vector elements are ignored; boundaries between syscalls provide the datagram boundaries, but you can e.g. do writev(udp_socket_fd, (struct iovec[3]){ {const_header, sizeof(const_header)}, {&n, 4}, {s, strlen(s)}}, 3) and have only one UDP packet sent. IOW, it's general-purpose scatter-gather for read and write. The last example shows that (2) isn't a subset of (1) - it's not always possible to call ->write() in loop and get the right behaviour. For regular files (and pure stream sockets, etc.) it would work, but for stuff like UDP sockets it would break. Moreover, even for regular files on storage filesystems it would be quite inefficient - we'd need to acquire and release a bunch of locks, poke through metadata, etc., for each segment. As the result, there was a couple of new methods added, inventively called ->readv() and ->writev(). do_sync_read() was supposed to be used as ->read() instance - it's "feed a single-element vector to ->readv()" and similar for s/read/write/. Note that both (1) and (2) satisfy the following sanity requirement - single-element readv() is always equivalent to single-element() read(). You could violate that, by providing completely unrelated ->read() and ->readv(), but very few drivers went for that - too insane. Then, when AIO had been added, those had grown an argument pointing to iocb (instead of file and ppos - for those we use iocb->ki_filp and &iocb->ki_pos resp.) and they got renamed into ->aio_read() and ->aio_write(). Note that non-vectored AIO uses the same methods - ->read() and ->write() had too many instances to convert and most of those would end up just using those two iocb fields instead of the old arguments - tons of churn for no good reason. ->readv() and ->writev() had fewer instances (very common was the use of generic_file_aio_{read,write}()) and conversion was less painful. So there was no ->aio_read() and ->aio_write(). That, in principle, was a bit of constraint - you couldn't make single-element AIO_PREADV behave different from AIO_PREAD, but nobody had been insane enough to ask for that. Moreover, keeping ->readv() and ->writev() was pointless. There is cheap way to tell whether ->aio_{read,write}() call is due to io_submit(2) or to readv()/writev() - is_sync_kiocb(iocb) tells which one it is, so if driver really wanted different semantics for sync vs. async, it could check that. So we ended up with ->read/->write for sync non-vectored and ->aio_read()/->aio_write() for sync vectored *and* async anything. Usually you provide one or the other - NULL ->aio_... means loop calling ->read/write on each element, NULL ->read/write (or do_sync_... for them - it's the same thing) means feeding sync iocb and single-element vector to ->aio_.... You *can* provide both, but that's rare and almost always pointless. These days ->aio_{read,write} is almost gone - replaced with ->{read,write}_iter(). That sucker is equivalent, except that you get a pointer struct iov_iter instead iovec, nr_seg, size triple. And you use linux/uio.h primitives for accessing the data (it might be backed by something other than userland ranges). The primitives in there are not harder to use than copy_..._user(), and you avoid any need to loop over iovec array, etc. Those primitives generally don't give a damn about the range boundaries; the only exception is iov_iter_single_seg_count(), which tells how much data is left to be consumed in the current segment; very few things are using it. is_sync_kiocb() is still available to tell io_submit from synchronous syscalls. I don't believe that it's worth switching the (1) \setminus (2) to those; it's still too much churn, plus we'd need the loop over segments somewhere to keep the semantics. It *can* be expressed by ->read_iter() and ->write_iter(), but it'll be tons of boilerplate code in a lot of drivers. So ->read() and ->write() are there to stay. However, I think that we can get to the situation when it's really either one or another - if you have ->read_iter()/->write_iter(), don't mess with custom ->read() and ->write(). Both function/f_fs.c and legacy/inode.c are in class (2) - they gather data from iovec into temp buffer in ->aio_write() and pass the buffer to the code doing actual IO, and on ->aio_read() side they give a buffer to read into, then arrange for it to be scattered into our iovec upon IO completion. And they are doing non-trivial async stuff, so I'd prefer to switch them completely to ->read_iter/->write_iter. The only obstacle is read vs. single-element readv and write vs. single-element writev behaviour difference. For some files it really can't be helped - {ipath,qib}fs has completely unrelated sets of commands accepted by write and writev, including the single-element one. And /dev/snd/pcm*, while somewhat milder, has non-trivial behaviour differences. I think that trick suggested by hch (put several flags in iocb, encoding the reason for the call; that would allow to tell one from another) is the best way to deal with those. But I would really prefer to avoid that; IMO those examples are simply bad userland APIs. legacy/inode.c is the third (and last) beast like those two. And there the difference is small enough to simply change readv/writev to be like read/write in that respect, i.e. also halt the endpoint when called on the isochronous one with wrong direction. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html