On 12/13/2012 07:52 AM, Maxim V. Patlasov wrote: > Hi Brian, > ... > ... >> > > Thanks a lot for review and further efforts. Your clean-up patch looks > fine, but it constrains the scope of patch-set too much: ordinary > synchronous dio-s (generated by read(2) and write(2)) cease to be > optimized. I think such a loss doesn't justify the clean-up. To be sure > that we are on the same page, I'll try to explain this optimization in > more details: > ... > > This is significant optimization for non-trivial fuse userspace > implementations because: 1) fuse requests might be processed > concurrently; 2) fuse requests might be merged before processing. > I missed that, thanks for the explanation. It makes sense and I agree, the clean up alone isn't worth that tradeoff. Basically, my assumption that we could trigger fuse request handling based on the type of iocb is invalid... > Now, let's proceed to technical details... Your idea to use iocb > everywhere instead of file and rely on its type (sync vs. async) is > cool. I like it! Unfortunately, to make it usable in both libaio and > sync dio cases we'll need to put our hands on generic linux-kernel code > as well. Let me explain why: > Before getting into that, I made some adjustments to the original patch to use your new fuse_io_priv structure in place of an iocb, primarily to decouple the request submission mechanism from the type of the user request. The highlighted differences are: - Updated fuse_io_priv with an async field and file pointer to preserve the current style of interface (i.e., use this instead of iocb). - Trigger the type of request submission based on the async field. - Reintroduce the wait in fuse_direct_IO(), but I also pulled up the fuse_write_update_size() call out of __fuse_direct_write() to make the separate paths more consistent. This could probably use more cleanups (i.e., it morphs your fuse_io_priv into more of a generic I/O descriptor), but illustrates the idea. > Currently, iocb has only one binary attribute - it's either sync or > async. aio_complete() derives its ancestry (libaio vs. read(2)/write(2)) > from its sync/async type: > >> if (is_sync_kiocb(iocb)) { >> BUG_ON(iocb->ki_users != 1); >> iocb->ki_user_data = res; >> iocb->ki_users = 0; >> wake_up_process(iocb->ki_obj.tsk); >> return 1; >> } > > In the other words, if it's sync, it's enough to wake up someone in > kernel (who called wait_on_sync_kiocb()) and return. Otherwise, > aio_complete() performs some steps to wake up user (who called > io_getevents()). What we need for fuse is another binary attribute: iocb > was generated by libaio vs. read(2)/write(2). Then we could use > aio_complete() and wait_on_sync_kiocb() for any iocb which was not > generated by libaio. E.g. to process sync dio in async way, we'd > allocate iocb in fuse_direct_IO on stack (as you suggested), but > initialize it as async iocb and pass it to __fuse_direct_read/write, > then call wait_on_sync_kiocb(), not to return -EIOCBQUEUED. > Thanks again for the explanation. I actually ran into this when I started playing around with the set (initially preserving an extending async submission and attempting to wait on it, iirc), then I undid that because the optimization wasn't clear to me. On the plus side, I didn't quite follow the need for the error recovery truncate but that's clear now as well. ;) > The problem is that FUSE tends to be the only user of this 'feature'. > I'm considering one-line change of aio_complete(): > > - if (is_sync_kiocb(iocb)) { > + if (!iocb->ki_ctx) { > > Though, not sure whether Miklos, Al Viro, Benjamin and others will > accept it... Let's try if you're OK about this approach. At least this > will cover all fuse dio use-cases (including libaio extending file) and > makes fuse code significantly cleaner. > Interesting idea. It seems like it could work to me but I'm not very familiar with that code and it might not be worth cluttering that interface for this particular case. I brought this up briefly with Jeff Moyer and he pointed me at this set (I just saw zab's mail pointing this out as well :): https://lkml.org/lkml/2012/10/22/321 ... which looks like it could help. Perhaps we could use that new type of kiocb with a custom callback to wake a blocking thread. That said, I wonder if we could do that anyways; i.e., define our own wait_on_fuse_async() type function that that waits on the number of in-flight requests in the fuse_priv_io. I guess we'd have to complicate the refcounting with that type of approach, though. Brian > Thanks, > Maxim --- diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c index beb99e9..0cfc4f9 100644 --- a/fs/fuse/cuse.c +++ b/fs/fuse/cuse.c @@ -92,8 +92,11 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count, { loff_t pos = 0; struct iovec iov = { .iov_base = buf, .iov_len = count }; + struct fuse_io_priv io = {0,}; - return fuse_direct_io(file, &iov, 1, count, &pos, 0, NULL); + io.file = file; + + return fuse_direct_io(&io, &iov, 1, count, &pos, 0); } static ssize_t cuse_write(struct file *file, const char __user *buf, @@ -101,12 +104,15 @@ static ssize_t cuse_write(struct file *file, const char __user *buf, { loff_t pos = 0; struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; + struct fuse_io_priv io = {0,}; + + io.file = file; /* * No locking or generic_write_checks(), the server is * responsible for locking and sanity checks. */ - return fuse_direct_io(file, &iov, 1, count, &pos, 1, NULL); + return fuse_direct_io(&io, &iov, 1, count, &pos, 1); } static int cuse_open(struct inode *inode, struct file *file) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index d2094e1..29b94e9 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -580,10 +580,8 @@ static void fuse_aio_complete_req(struct fuse_conn *fc, struct fuse_req *req) } static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req, - size_t num_bytes, struct kiocb *iocb) + size_t num_bytes, struct fuse_io_priv *io) { - struct fuse_io_priv *io = iocb->private; - spin_lock(&io->lock); io->size += num_bytes; io->reqs++; @@ -597,10 +595,10 @@ static size_t fuse_async_req_send(struct fuse_conn *fc, struct fuse_req *req, return num_bytes; } -static size_t fuse_send_read(struct fuse_req *req, struct file *file, - loff_t pos, size_t count, fl_owner_t owner, - struct kiocb *async) +static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io, + loff_t pos, size_t count, fl_owner_t owner) { + struct file *file = io->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; @@ -612,8 +610,8 @@ static size_t fuse_send_read(struct fuse_req *req, struct file *file, inarg->lock_owner = fuse_lock_owner_id(fc, owner); } - if (async) - return fuse_async_req_send(fc, req, count, async); + if (io->async) + return fuse_async_req_send(fc, req, count, io); fuse_request_send(fc, req); return req->out.args[0].size; @@ -635,6 +633,7 @@ static void fuse_read_update_size(struct inode *inode, loff_t size, static int fuse_readpage(struct file *file, struct page *page) { + struct fuse_io_priv io = {0,}; struct inode *inode = page->mapping->host; struct fuse_conn *fc = get_fuse_conn(inode); struct fuse_req *req; @@ -648,6 +647,8 @@ static int fuse_readpage(struct file *file, struct page *page) if (is_bad_inode(inode)) goto out; + io.file = file; + /* * Page writeback can extend beyond the lifetime of the * page-cache page, so make sure we read a properly synced @@ -667,7 +668,7 @@ static int fuse_readpage(struct file *file, struct page *page) req->num_pages = 1; req->pages[0] = page; req->page_descs[0].length = count; - num_read = fuse_send_read(req, file, pos, count, NULL, NULL); + num_read = fuse_send_read(req, &io, pos, count, NULL); err = req->out.h.error; fuse_put_request(fc, req); @@ -869,10 +870,10 @@ static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff, req->out.args[0].value = outarg; } -static size_t fuse_send_write(struct fuse_req *req, struct file *file, - loff_t pos, size_t count, fl_owner_t owner, - struct kiocb *async) +static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io, + loff_t pos, size_t count, fl_owner_t owner) { + struct file *file = io->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; struct fuse_write_in *inarg = &req->misc.write.in; @@ -884,8 +885,8 @@ static size_t fuse_send_write(struct fuse_req *req, struct file *file, inarg->lock_owner = fuse_lock_owner_id(fc, owner); } - if (async) - return fuse_async_req_send(fc, req, count, async); + if (io->async) + return fuse_async_req_send(fc, req, count, io); fuse_request_send(fc, req); return req->misc.write.out.size; @@ -910,11 +911,14 @@ static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file, size_t res; unsigned offset; unsigned i; + struct fuse_io_priv io = {0,}; + + io.file = file; for (i = 0; i < req->num_pages; i++) fuse_wait_on_page_writeback(inode, req->pages[i]->index); - res = fuse_send_write(req, file, pos, count, NULL, NULL); + res = fuse_send_write(req, &io, pos, count, NULL); offset = req->page_descs[0].offset; count = res; @@ -1252,10 +1256,11 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p) return min(npages, FUSE_MAX_PAGES_PER_REQ); } -ssize_t fuse_direct_io(struct file *file, const struct iovec *iov, +ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, unsigned long nr_segs, size_t count, loff_t *ppos, - int write, struct kiocb *async) + int write) { + struct file *file = io->file; struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; size_t nmax = write ? fc->max_write : fc->max_read; @@ -1276,7 +1281,7 @@ ssize_t fuse_direct_io(struct file *file, const struct iovec *iov, size_t nbytes = min(count, nmax); int err = fuse_get_user_pages(req, &ii, &nbytes, write); if (err) { - if (async) + if (io->async) fuse_put_request(fc, req); res = err; @@ -1284,13 +1289,11 @@ ssize_t fuse_direct_io(struct file *file, const struct iovec *iov, } if (write) - nres = fuse_send_write(req, file, pos, nbytes, owner, - async); + nres = fuse_send_write(req, io, pos, nbytes, owner); else - nres = fuse_send_read(req, file, pos, nbytes, owner, - async); + nres = fuse_send_read(req, io, pos, nbytes, owner); - if (!async) + if (!io->async) fuse_release_user_pages(req, !write); if (req->out.h.error) { if (!res) @@ -1306,14 +1309,14 @@ ssize_t fuse_direct_io(struct file *file, const struct iovec *iov, if (nres != nbytes) break; if (count) { - if (!async) + if (!io->async) fuse_put_request(fc, req); req = fuse_get_req(fc, fuse_iter_npages(&ii)); if (IS_ERR(req)) break; } } - if (!IS_ERR(req) && !async) + if (!IS_ERR(req) && !io->async) fuse_put_request(fc, req); if (res > 0) *ppos = pos; @@ -1322,17 +1325,18 @@ ssize_t fuse_direct_io(struct file *file, const struct iovec *iov, } EXPORT_SYMBOL_GPL(fuse_direct_io); -static ssize_t __fuse_direct_read(struct file *file, const struct iovec *iov, +static ssize_t __fuse_direct_read(struct fuse_io_priv *io, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos, - struct kiocb *async, size_t count) + size_t count) { ssize_t res; + struct file *file = io->file; struct inode *inode = file->f_path.dentry->d_inode; if (is_bad_inode(inode)) return -EIO; - res = fuse_direct_io(file, iov, nr_segs, count, ppos, 0, async); + res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0); fuse_invalidate_attr(inode); @@ -1342,25 +1346,23 @@ static ssize_t __fuse_direct_read(struct file *file, const struct iovec *iov, static ssize_t fuse_direct_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { + struct fuse_io_priv io = {0,}; struct iovec iov = { .iov_base = buf, .iov_len = count }; - return __fuse_direct_read(file, &iov, 1, ppos, NULL, count); + io.file = file; + return __fuse_direct_read(&io, &iov, 1, ppos, count); } -static ssize_t __fuse_direct_write(struct file *file, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos, - struct kiocb *async) +static ssize_t __fuse_direct_write(struct fuse_io_priv *io, const struct iovec *iov, + unsigned long nr_segs, loff_t *ppos) { + struct file *file = io->file; struct inode *inode = file->f_path.dentry->d_inode; size_t count = iov_length(iov, nr_segs); ssize_t res; res = generic_write_checks(file, ppos, &count, 0); - if (!res) { - res = fuse_direct_io(file, iov, nr_segs, count, ppos, 1, - async); - if (!async && res > 0) - fuse_write_update_size(inode, *ppos); - } + if (!res) + res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1); fuse_invalidate_attr(inode); @@ -1373,13 +1375,18 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf, struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count }; struct inode *inode = file->f_path.dentry->d_inode; ssize_t res; + struct fuse_io_priv io = {0,}; if (is_bad_inode(inode)) return -EIO; - /* Don't allow parallel writes to the same file */ + io.file = file; + + /* don't allow parallel writes to the same file */ mutex_lock(&inode->i_mutex); - res = __fuse_direct_write(file, &iov, 1, ppos, NULL); + res = __fuse_direct_write(&io, &iov, 1, ppos); + if (res > 0) + fuse_write_update_size(inode, *ppos); mutex_unlock(&inode->i_mutex); return res; @@ -2397,7 +2404,7 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct inode *inode; loff_t i_size; size_t count = iov_length(iov, nr_segs); - struct kiocb *async_cb = NULL; + struct fuse_io_priv *io; file = iocb->ki_filp; pos = offset; @@ -2411,48 +2418,56 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, count = i_size - offset; } - /* cannot write beyond eof asynchronously */ - if (is_sync_kiocb(iocb) || (offset + count <= i_size)) { - struct fuse_io_priv *io; + io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL); + if (!io) + return -ENOMEM; + spin_lock_init(&io->lock); + io->reqs = 1; + io->bytes = -1; + io->size = 0; + io->offset = offset; + io->write = (rw == WRITE); + io->err = 0; + io->file = file; + /* + * By default, we want to optimize all I/Os with async request submission + * to the client filesystem. + */ + io->async = 1; + io->iocb = iocb; - io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL); - if (!io) - return -ENOMEM; + iocb->private = io; - spin_lock_init(&io->lock); - io->reqs = 1; - io->bytes = -1; - io->size = 0; - io->offset = offset; - io->write = (rw == WRITE); - io->err = 0; - io->iocb = iocb; - iocb->private = io; - - async_cb = iocb; - } + /* + * We cannot asynchronously extend the size of a file. We have no method + * to wait on real async I/O requests, so we must submit this request + * synchronously. + */ + if (!is_sync_kiocb(iocb) && (offset + count > i_size)) + io->async = 0; if (rw == WRITE) - ret = __fuse_direct_write(file, iov, nr_segs, &pos, async_cb); + ret = __fuse_direct_write(io, iov, nr_segs, &pos); else - ret = __fuse_direct_read(file, iov, nr_segs, &pos, async_cb, - count); - - if (async_cb) { - fuse_aio_complete(async_cb->private, ret == count ? 0 : -EIO, - -1); + ret = __fuse_direct_read(io, iov, nr_segs, &pos, count); + if (io->async) { + fuse_aio_complete(io, ret == count ? 0 : -EIO, -1); + + /* we have a non-extending, async request, so return */ if (!is_sync_kiocb(iocb)) return -EIOCBQUEUED; ret = wait_on_sync_kiocb(iocb); + } else { + kfree(io); + } - if (rw == WRITE) { - if (ret > 0) - fuse_write_update_size(inode, pos); - else if (ret < 0 && offset + count > i_size) - fuse_do_truncate(file); - } + if (rw == WRITE) { + if (ret > 0) + fuse_write_update_size(inode, pos); + else if (ret < 0 && offset + count > i_size) + fuse_do_truncate(file); } return ret; diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 173c959..91b5192 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -221,6 +221,7 @@ enum fuse_req_state { /** The request IO state (for asynchronous processing) */ struct fuse_io_priv { + int async; spinlock_t lock; unsigned reqs; ssize_t bytes; @@ -229,6 +230,7 @@ struct fuse_io_priv { bool write; int err; struct kiocb *iocb; + struct file *file; }; /** @@ -826,9 +828,9 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid, int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file, bool isdir); -ssize_t fuse_direct_io(struct file *file, const struct iovec *iov, +ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov, unsigned long nr_segs, size_t count, loff_t *ppos, - int write, struct kiocb *async); + int write); long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg, unsigned int flags); long fuse_ioctl_common(struct file *file, unsigned int cmd, -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html