Re: [PATCH 3/5] fs: remove ki_nbytes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 02, 2015 at 08:11:12AM +0000, Al Viro wrote:
> On Mon, Feb 02, 2015 at 09:07:29AM +0100, Christoph Hellwig wrote:
> > On Sat, Jan 31, 2015 at 06:08:41AM +0000, Al Viro wrote:
> > > and be done with that.  I'll put drivers/usb/gadget patches into a stable
> > > branch and ask use folks to pull from it - that's the simplest of this
> > > series, actually...
> > 
> > use folks?  Btw, does this mean you patches to switch it over, or do
> 
> usb.
> 
> > you want me to finish my conversion.
> 
> I have f_fs.c conversion and partial legacy/inode.c one.  Will post tomorrow,
> about to fall asleep right now...

FWIW, here's the current situation with read/write methods:

1) a bunch of file_operations instances have only ->read() and/or ->write();
no ->aio_... or ->..._iter versions, no ->splice_write().  Those are basically
device drivers:
	readv is equivalent to loop over segments, with read on each.
	writev is equivalent to loop over segments, with write on each.
	AIO read and write are all synchronous
	splice_write is does kmap on each buffer in turn and write that
Note that splitting a buffer into several adjacent pieces and submitting
a vectored IO is *not* guaranteed to be equivalent to plain IO on the entire
buffer - it has datagram-like semantics and boundaries do matter.

2) regular ->read_iter/->write_iter users:
	->read is new_sync_read() (or NULL, if ->read_iter is NULL)
	->write is new_sync_write() (or NULL, if ->write_iter is NULL)
	->aio_read and ->aio_write are NULL
	->splice_write is iter_splice_write (or NULL, if ->write_iter is NULL)
Those are stream-style ones; what matter is concatenation of the buffers, not
the boundaries.  Right now the following is true: ->read_iter/->write_iter on
a sync kiocb never returns EIOCBQUEUED and iterator argument is advanced
exactly by the amount of data transferred.

3) ones like (2), but with NULL ->splice_write() in spite of present
->write_iter().  AFAICS they can be bulk-converted to (2).  This is what the
bulk of irregularities boil down to.

4) ones like (2) or (3), except ->read() and ->write() are left NULL instead
of doing new_sync_{read,write}().  Mostly equivalent to (2); some codepaths
call ->read() or ->write() directly, instead of going through vfs_... wrappers,
and for those it can be a bit of a headache.  In any case, there are very
few such instances (only 3) and they are trivial to convert to (2).

5) Two instances in fs/9p have ->write_iter() *and* ->write(), the latter not
being new_sync_write().  Ditto for ->read_iter() and ->read().  No other
instance has such combinations.  FWIW, they are *almost* regular - their
->read() and ->write() are new_sync_... unless it's an O_DIRECT file.
It might make sense to try and convert those to ->direct_IO(); as it is, their
O_DIRECT is ignored for readv()/writev(), which is arguably a bug.

6) drivers/char/mem.c stuff; they are equivalent to (2), but somewhat
optimised.  Some of that might make sense to convert to (2); there _are_ hot
paths involved, so we'd better be careful there.

7) bad_file_ops.  It is equivalent to (2); it's also a very special
case.  FWIW, I'm not at all sure that we *need* most of the methods in
there.  We never replace ->f_op of an opened file with that, so if we
end up with that sucker, it happens on fresh open of something that had
its ->i_fop replaced with bad_file_ops.  And that will instantly fail in
bad_file_open().  Why do we need the rest of the methods in there?
AFAICS, we should drop all but ->open().

8) hypfs - AFAICS, converts to (2) easily; I'll do that.

9) FUSE - ->aio_read/->aio_write user, with unusual ->splice_write as
well.  I _think_ it can be converted to (2), but that'll take a bit of
work.

10) sockets; conversion to ->read_iter/->write_iter had been posted to
netdev, ->splice_write() (the only user of ->sendpage(), BTW) is a work
for the next cycle.

11) NTFS with rw support enabled.  ->aio_write() user, needs to be converted
to ->write_iter().  AFAICS, it wasn't particularly high priority for Anton
(along with all rw stuff in fs/ntfs); it doesn't look terribly hard, but then
it wasn't a high priority for me either.

12) virtio_console.  Interesting ->splice_write() there; hadn't looked deep
enough.

13) two odd drivers/usb/gadget instances.  I have conversion for f_fs.c,
but legacy/inode.c (ep_read() et.al.) is trickier.  The problem in there
is that writev() on a single-element vector is *not* equivalent to plain
write().  The former treats the wrong-direction endpoint as EINVAL; the
latter does
                if (usb_endpoint_xfer_isoc(&data->desc)) {
                        mutex_unlock(&data->lock);
                        return -EINVAL;
                }
                DBG (data->dev, "%s halt\n", data->name);
                spin_lock_irq (&data->dev->lock);
                if (likely (data->ep != NULL))
                        usb_ep_set_halt (data->ep);
                spin_unlock_irq (&data->dev->lock);
                mutex_unlock(&data->lock);
                return -EBADMSG;
instead.  IOW, for isochronous endpoints behaviour is the same, but the
rest behaves differently.  If not for that, that sucker would convert
to (3) easily; ->splice_write() semantics is potentially interesting -
the question is where do we want transfer boundaries.  As it is, writev()
collects the entire iovec and shoves it into single transfer; splice() to
that thing ends up with each pipe_buf going in a separate transfer, mergable
or not.  I would really appreciate comments on semantics of that thing from
USB folks...

14) ipathfs and qibfs: seriously different semantics for write and writev/AIO
write.  As in "different set of commands recognized"; AIO write plays like
writev, whether it's vectored or not (and it's always synchronous).
I've no idea who had come up with that... highly innovative API or why
hadn't they simply added two files (it's all on their virtual filesystem,
so they had full control of layout) rather that multiplexing two different
command sets in such a fashion.

15) /dev/snd/pcmC*D*[cp].  Again, different semantics for write and writev,
with the latter wanting nr_seqs equal to the number of channels.  AIO
non-vectored write fails unless there's only one channel.  Not sure how
ALSA userland uses that thing; AIO side is always synchronous, so it might
be simply never used.  FWIW, I'm not sure that write() on a single-channel
one is equivalent to 1-element writev() - silencing-related logics seem to
differ.

That's it.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux