On Fri 12-07-13 10:44:21, Dave Chinner wrote: > On Thu, Jul 11, 2013 at 12:02:18AM +0200, Jan Kara wrote: > > From: Christoph Hellwig <hch@xxxxxxxxxxxxx> > > > > Add support to the core direct-io code to defer AIO completions to user > > context using a workqueue. This replaces opencoded and less efficient > > code in XFS and ext4 and will be needed to properly support O_(D)SYNC > > for AIO. > > I don't see how this is any more efficient than the deferral > that XFS already does for direct IO completion - it just changes > what queues IO for completions. I didn't change the changelog so it's upto Christoph to tell. But I can remove the 'less efficient' part from the changelog if you want. > And on that topic: > > > Currently this creates a per-superblock unbound workqueue for these > > completions, which is taken from an earlier patch by Jan Kara. I'm > > not really convinced about this use and would prefer a "normal" global > > workqueue with a high concurrency limit, but this needs further discussion. > > Unbound workqueues sacrifice locality for immediate dispatch. > AIO-DIO performance at the high end is determined by IO submission > and completion locality - we want submission and completion to occur > on the same CPU as much as possible so cachelines are not bounced > aroundi needlessly. An unbound workqueue would seem to be the wrong > choice on this basis alone. > > Further the use of @max_active = 1 means that it is a global, single > threaded workqueue. While ext4 might require single threaded > execution of unwritten extent completion, it would be introducing a > massive bottleneck into XFS which currently uses the default > concurrency depth of 256 worker threads per CPU per superblock just > for unwritten extent conversion. > > Hmmmm. I notice that the next patch then does generic_write_sync() > is the IO completion handler, too. In XFS that causes log forces and > will block, so that's yet more concurrency required that is required > for this workqueue. Doing this in a single threaded workqueue and > marshalling all sync AIO through it is highly unfriendly to > concurrent IO completion. > > FWIW, in XFS we queue unwritten extent conversion completions on a > different workqueue to EOF size update completions because the > latter are small, fast and rarely require IO or get blocked. The > data IO completion workqueue for EOF updates has the same > concurrency and depth as the unwritten extent work queue (i.e. 256 > workers per cpu per superblock). So pushing all of this DIO and EOF > completion work into a single threaded global workqueue that can > block in every IO completion doesn't seem like a very smart idea to > me... OK, so you'd rather have the workqueue created like: alloc_workqueue("dio-sync", WQ_MEM_RECLAIM, 0) I can certainly try that. ext4 should handle that fine. And I agree with your arguments for high end devices. I'm just wondering whether it won't do some harm to a simple SATA drive. Having more threads trying to do extent conversion in parallel might cause the drive to seek more. > > -static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async) > > +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, > > + bool is_async) > > { > > ssize_t transferred = 0; > > > > @@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is > > if (ret == 0) > > ret = transferred; > > > > - if (dio->end_io && dio->result) { > > - dio->end_io(dio->iocb, offset, transferred, > > - dio->private, ret, is_async); > > - } else { > > - inode_dio_done(dio->inode); > > - if (is_async) > > - aio_complete(dio->iocb, ret, 0); > > - } > > + if (dio->end_io && dio->result) > > + dio->end_io(dio->iocb, offset, transferred, dio->private); > > Ok, so by here we are assuming all filesystem IO completion > processing is completed. Yes. > > + > > + inode_dio_done(dio->inode); > > + if (is_async) > > + aio_complete(dio->iocb, ret, 0); > > > > + kmem_cache_free(dio_cache, dio); > > return ret; > > Hmmm. I started wonder if this is valid, because XFS is supposed to > be doing workqueue based IO completion for appending writes and they > are supposed to be run in a workqueue. > > But, looking at the XFS code, there's actually a bug in the direct > IO completion code and we are not defering completion to a workqueue > like we should be for the append case. This code in > xfs_finish_ioend() demonstrates the intent: > > if (ioend->io_type == XFS_IO_UNWRITTEN) > queue_work(mp->m_unwritten_workqueue, &ioend->io_work); > else if (ioend->io_append_trans || > >>>>>> (ioend->io_isdirect && xfs_ioend_is_append(ioend))) > queue_work(mp->m_data_workqueue, &ioend->io_work); > > The problem is that we only ever call xfs_finish_ioend() if is_async > is true, and that will never be true for direct IO beyond the current > EOF. That's a bug, and is Bad(tm). > > [ Interestingly, it turns out that dio->is_async is never set for > writes beyond EOF because of filesystems that update file sizes > before data IO completion occurs (stale data exposure issues). For > XFS, that can't happen and so dio->is_async could always be set.... ] > > What this means is that there's a condition for work queue deferral > of DIO IO completion that this patch doesn't handle. Setting deferral > only on unwritten extents like this: > > > @@ -1292,8 +1281,10 @@ __xfs_get_blocks( > > if (create || !ISUNWRITTEN(&imap)) > > xfs_map_buffer(inode, bh_result, &imap, offset); > > if (create && ISUNWRITTEN(&imap)) { > > - if (direct) > > + if (direct) { > > bh_result->b_private = inode; > > + set_buffer_defer_completion(bh_result); > > + } > > set_buffer_unwritten(bh_result); > > } > > } > > misses that case. I suspect Christoph's original patch predated the > changes to XFS that added transactional file size updates to IO > completion and so didn't take it into account. OK, thanks for catching this. Doing the i_size check in _xfs_get_blocks() would be somewhat cumbersome I presume so I guess we can handle that case by adding __blockdev_direct_IO() flag to defer dio completion to a workqueue. XFS can then set the flag when it sees it needs and i_size update. OK? > > @@ -1390,9 +1381,7 @@ xfs_end_io_direct_write( > > struct kiocb *iocb, > > loff_t offset, > > ssize_t size, > > - void *private, > > - int ret, > > - bool is_async) > > + void *private) > > { > > struct xfs_ioend *ioend = iocb->private; > > > > @@ -1414,17 +1403,10 @@ xfs_end_io_direct_write( > > > > ioend->io_offset = offset; > > ioend->io_size = size; > > - ioend->io_iocb = iocb; > > - ioend->io_result = ret; > > if (private && size > 0) > > ioend->io_type = XFS_IO_UNWRITTEN; > > > > - if (is_async) { > > - ioend->io_isasync = 1; > > - xfs_finish_ioend(ioend); > > - } else { > > - xfs_finish_ioend_sync(ioend); > > - } > > + xfs_finish_ioend_sync(ioend); > > } > > As i mentioned, the existing code here is buggy. What we should be > doing here is: > > if (is_async) { > ioend->io_isasync = 1; > xfs_finish_ioend(ioend); > } else if (xfs_ioend_is_append(ioend))) { > xfs_finish_ioend(ioend); > } else { > xfs_finish_ioend_sync(ioend); > } Umm, I started to wonder why do you actually need to defer the completion for appending ioend. Because if DIO isn't async, dio_complete() won't be called from interrupt context but from __blockdev_direct_IO(). So it seems you can do everything you need directly there without deferring to a workqueue. But maybe there's some locking subtlety I'm missing. > Which means that the new code boils down to: > > if (xfs_ioend_is_append(ioend))) > xfs_finish_ioend(ioend); > else > xfs_finish_ioend_sync(ioend); > > And now we see the problem with the only defering unwritten IO - > the new direct IO code will think IO is completed when all we've > done is queued it to another workqueue. I agree. New direct IO completion handlers have to really finish everything without deferring to a separate workqueue. ext4 had a same bug but there I caught it and fixed. > I'm not sure we can handle this in get_blocks - we don't have the > context to know how to treat allocation beyond the current EOF. > Indeed, the current block being mapped might not be beyond EOF, but > some of the IO might be (e.g. lies across an extent boundary), > so setting the deferred completion in get_blocks doesn't allow the > entire IO to be treated the same. > > So I think there's a bit of re-thinking needed to aspects of this > patch to be done. Additional flag to __blockdev_direct_IO() should solve this as I wrote above (if you really need it). > > + * avoids problems with pseudo filesystems which get initialized > > + * before workqueues can be created > > + */ > > + if (type->fs_flags & FS_REQUIRES_DEV) { > > + s->s_dio_done_wq = > > + alloc_workqueue("dio-sync", WQ_UNBOUND, 1); > > The workqueue name needs to be combined with the fs_name so we know > what filesystem is having problems in sysrq-w output. Yes, that's reasonable. However that means we'll have to initialize the workqueue later. Hum.. I think I will return to the on-demand creation of the workqueue as I originally had in my patch set. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html