Hi Dave, I remembered about this patch set and realized I didn't get reply from you regarding the following question (see quoted email below for details): Do you really need to defer completion of appending direct IO? Because generic code makes sure appending direct IO isn't async and thus dio_complete() -> xfs_end_io_direct_write() gets called directly from do_blockdev_direct_IO(). I.e. from a normal context and not from interrupt. I've already addressed rest of your comments so this is the only item that is remaining. On Tue 16-07-13 23:00:27, Jan Kara wrote: > > > -static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async) > > > +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, > > > + bool is_async) > > > { > > > ssize_t transferred = 0; > > > > > > @@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is > > > if (ret == 0) > > > ret = transferred; > > > > > > - if (dio->end_io && dio->result) { > > > - dio->end_io(dio->iocb, offset, transferred, > > > - dio->private, ret, is_async); > > > - } else { > > > - inode_dio_done(dio->inode); > > > - if (is_async) > > > - aio_complete(dio->iocb, ret, 0); > > > - } > > > + if (dio->end_io && dio->result) > > > + dio->end_io(dio->iocb, offset, transferred, dio->private); > > > > Ok, so by here we are assuming all filesystem IO completion > > processing is completed. > Yes. > > > > + > > > + inode_dio_done(dio->inode); > > > + if (is_async) > > > + aio_complete(dio->iocb, ret, 0); > > > > > > + kmem_cache_free(dio_cache, dio); > > > return ret; > > > > Hmmm. I started wonder if this is valid, because XFS is supposed to > > be doing workqueue based IO completion for appending writes and they > > are supposed to be run in a workqueue. > > > > But, looking at the XFS code, there's actually a bug in the direct > > IO completion code and we are not defering completion to a workqueue > > like we should be for the append case. This code in > > xfs_finish_ioend() demonstrates the intent: > > > > if (ioend->io_type == XFS_IO_UNWRITTEN) > > queue_work(mp->m_unwritten_workqueue, &ioend->io_work); > > else if (ioend->io_append_trans || > > >>>>>> (ioend->io_isdirect && xfs_ioend_is_append(ioend))) > > queue_work(mp->m_data_workqueue, &ioend->io_work); > > > > The problem is that we only ever call xfs_finish_ioend() if is_async > > is true, and that will never be true for direct IO beyond the current > > EOF. That's a bug, and is Bad(tm). > > > > [ Interestingly, it turns out that dio->is_async is never set for > > writes beyond EOF because of filesystems that update file sizes > > before data IO completion occurs (stale data exposure issues). For > > XFS, that can't happen and so dio->is_async could always be set.... ] > > > > What this means is that there's a condition for work queue deferral > > of DIO IO completion that this patch doesn't handle. Setting deferral > > only on unwritten extents like this: > > > > > @@ -1292,8 +1281,10 @@ __xfs_get_blocks( > > > if (create || !ISUNWRITTEN(&imap)) > > > xfs_map_buffer(inode, bh_result, &imap, offset); > > > if (create && ISUNWRITTEN(&imap)) { > > > - if (direct) > > > + if (direct) { > > > bh_result->b_private = inode; > > > + set_buffer_defer_completion(bh_result); > > > + } > > > set_buffer_unwritten(bh_result); > > > } > > > } > > > > misses that case. I suspect Christoph's original patch predated the > > changes to XFS that added transactional file size updates to IO > > completion and so didn't take it into account. > OK, thanks for catching this. Doing the i_size check in > _xfs_get_blocks() would be somewhat cumbersome I presume so I guess we can > handle that case by adding __blockdev_direct_IO() flag to defer dio > completion to a workqueue. XFS can then set the flag when it sees it needs > and i_size update. OK? > > > > @@ -1390,9 +1381,7 @@ xfs_end_io_direct_write( > > > struct kiocb *iocb, > > > loff_t offset, > > > ssize_t size, > > > - void *private, > > > - int ret, > > > - bool is_async) > > > + void *private) > > > { > > > struct xfs_ioend *ioend = iocb->private; > > > > > > @@ -1414,17 +1403,10 @@ xfs_end_io_direct_write( > > > > > > ioend->io_offset = offset; > > > ioend->io_size = size; > > > - ioend->io_iocb = iocb; > > > - ioend->io_result = ret; > > > if (private && size > 0) > > > ioend->io_type = XFS_IO_UNWRITTEN; > > > > > > - if (is_async) { > > > - ioend->io_isasync = 1; > > > - xfs_finish_ioend(ioend); > > > - } else { > > > - xfs_finish_ioend_sync(ioend); > > > - } > > > + xfs_finish_ioend_sync(ioend); > > > } > > > > As i mentioned, the existing code here is buggy. What we should be > > doing here is: > > > > if (is_async) { > > ioend->io_isasync = 1; > > xfs_finish_ioend(ioend); > > } else if (xfs_ioend_is_append(ioend))) { > > xfs_finish_ioend(ioend); > > } else { > > xfs_finish_ioend_sync(ioend); > > } > Umm, I started to wonder why do you actually need to defer the completion > for appending ioend. Because if DIO isn't async, dio_complete() won't be > called from interrupt context but from __blockdev_direct_IO(). So it seems > you can do everything you need directly there without deferring to a > workqueue. But maybe there's some locking subtlety I'm missing. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html