On Thu, Jul 11, 2013 at 12:02:18AM +0200, Jan Kara wrote: > From: Christoph Hellwig <hch@xxxxxxxxxxxxx> > > Add support to the core direct-io code to defer AIO completions to user > context using a workqueue. This replaces opencoded and less efficient > code in XFS and ext4 and will be needed to properly support O_(D)SYNC > for AIO. I don't see how this is any more efficient than the deferral that XFS already does for direct IO completion - it just changes what queues IO for completions. And on that topic: > Currently this creates a per-superblock unbound workqueue for these > completions, which is taken from an earlier patch by Jan Kara. I'm > not really convinced about this use and would prefer a "normal" global > workqueue with a high concurrency limit, but this needs further discussion. Unbound workqueues sacrifice locality for immediate dispatch. AIO-DIO performance at the high end is determined by IO submission and completion locality - we want submission and completion to occur on the same CPU as much as possible so cachelines are not bounced aroundi needlessly. An unbound workqueue would seem to be the wrong choice on this basis alone. Further the use of @max_active = 1 means that it is a global, single threaded workqueue. While ext4 might require single threaded execution of unwritten extent completion, it would be introducing a massive bottleneck into XFS which currently uses the default concurrency depth of 256 worker threads per CPU per superblock just for unwritten extent conversion. Hmmmm. I notice that the next patch then does generic_write_sync() is the IO completion handler, too. In XFS that causes log forces and will block, so that's yet more concurrency required that is required for this workqueue. Doing this in a single threaded workqueue and marshalling all sync AIO through it is highly unfriendly to concurrent IO completion. FWIW, in XFS we queue unwritten extent conversion completions on a different workqueue to EOF size update completions because the latter are small, fast and rarely require IO or get blocked. The data IO completion workqueue for EOF updates has the same concurrency and depth as the unwritten extent work queue (i.e. 256 workers per cpu per superblock). So pushing all of this DIO and EOF completion work into a single threaded global workqueue that can block in every IO completion doesn't seem like a very smart idea to me... > -static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async) > +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, > + bool is_async) > { > ssize_t transferred = 0; > > @@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is > if (ret == 0) > ret = transferred; > > - if (dio->end_io && dio->result) { > - dio->end_io(dio->iocb, offset, transferred, > - dio->private, ret, is_async); > - } else { > - inode_dio_done(dio->inode); > - if (is_async) > - aio_complete(dio->iocb, ret, 0); > - } > + if (dio->end_io && dio->result) > + dio->end_io(dio->iocb, offset, transferred, dio->private); Ok, so by here we are assuming all filesystem IO completion processing is completed. > + > + inode_dio_done(dio->inode); > + if (is_async) > + aio_complete(dio->iocb, ret, 0); > > + kmem_cache_free(dio_cache, dio); > return ret; Hmmm. I started wonder if this is valid, because XFS is supposed to be doing workqueue based IO completion for appending writes and they are supposed to be run in a workqueue. But, looking at the XFS code, there's actually a bug in the direct IO completion code and we are not defering completion to a workqueue like we should be for the append case. This code in xfs_finish_ioend() demonstrates the intent: if (ioend->io_type == XFS_IO_UNWRITTEN) queue_work(mp->m_unwritten_workqueue, &ioend->io_work); else if (ioend->io_append_trans || >>>>>> (ioend->io_isdirect && xfs_ioend_is_append(ioend))) queue_work(mp->m_data_workqueue, &ioend->io_work); The problem is that we only ever call xfs_finish_ioend() if is_async is true, and that will never be true for direct IO beyond the current EOF. That's a bug, and is Bad(tm). [ Interestingly, it turns out that dio->is_async is never set for writes beyond EOF because of filesystems that update file sizes before data IO completion occurs (stale data exposure issues). For XFS, that can't happen and so dio->is_async could always be set.... ] What this means is that there's a condition for work queue deferral of DIO IO completion that this patch doesn't handle. Setting deferral only on unwritten extents like this: > @@ -1292,8 +1281,10 @@ __xfs_get_blocks( > if (create || !ISUNWRITTEN(&imap)) > xfs_map_buffer(inode, bh_result, &imap, offset); > if (create && ISUNWRITTEN(&imap)) { > - if (direct) > + if (direct) { > bh_result->b_private = inode; > + set_buffer_defer_completion(bh_result); > + } > set_buffer_unwritten(bh_result); > } > } misses that case. I suspect Christoph's original patch predated the changes to XFS that added transactional file size updates to IO completion and so didn't take it into account. > @@ -1390,9 +1381,7 @@ xfs_end_io_direct_write( > struct kiocb *iocb, > loff_t offset, > ssize_t size, > - void *private, > - int ret, > - bool is_async) > + void *private) > { > struct xfs_ioend *ioend = iocb->private; > > @@ -1414,17 +1403,10 @@ xfs_end_io_direct_write( > > ioend->io_offset = offset; > ioend->io_size = size; > - ioend->io_iocb = iocb; > - ioend->io_result = ret; > if (private && size > 0) > ioend->io_type = XFS_IO_UNWRITTEN; > > - if (is_async) { > - ioend->io_isasync = 1; > - xfs_finish_ioend(ioend); > - } else { > - xfs_finish_ioend_sync(ioend); > - } > + xfs_finish_ioend_sync(ioend); > } As i mentioned, the existing code here is buggy. What we should be doing here is: if (is_async) { ioend->io_isasync = 1; xfs_finish_ioend(ioend); } else if (xfs_ioend_is_append(ioend))) { xfs_finish_ioend(ioend); } else { xfs_finish_ioend_sync(ioend); } Which means that the new code boils down to: if (xfs_ioend_is_append(ioend))) xfs_finish_ioend(ioend); else xfs_finish_ioend_sync(ioend); And now we see the problem with the only defering unwritten IO - the new direct IO code will think IO is completed when all we've done is queued it to another workqueue. I'm not sure we can handle this in get_blocks - we don't have the context to know how to treat allocation beyond the current EOF. Indeed, the current block being mapped might not be beyond EOF, but some of the IO might be (e.g. lies across an extent boundary), so setting the deferred completion in get_blocks doesn't allow the entire IO to be treated the same. So I think there's a bit of re-thinking needed to aspects of this patch to be done. > + * avoids problems with pseudo filesystems which get initialized > + * before workqueues can be created > + */ > + if (type->fs_flags & FS_REQUIRES_DEV) { > + s->s_dio_done_wq = > + alloc_workqueue("dio-sync", WQ_UNBOUND, 1); The workqueue name needs to be combined with the fs_name so we know what filesystem is having problems in sysrq-w output. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html