On Tue, Nov 06, 2012 at 11:54:06AM -0500, Jeff Moyer wrote: > "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> writes: > > > Hi all, > > > > One of our (app) developers noticed that io_submit() takes a very long time to > > return if the program initiates a write to a block device that's been opened in > > O_SYNC and O_DIRECTIO mode. We traced the slowness to blkdev_aio_write, which > > seems to initiate a disk cache flush if __generic_file_aio_write returns a > > positive value or -EIOCBQUEUED. Usually we see -EIOCBQUEUED returned, which > > triggers the flush, hence io_submit() stalls for a long time. That doesn't > > really feel like the intended usage pattern for aio. > > > > This -EIOCBQUEUED case seems a little strange -- if an async io has been queued > > (but not necessarily completed), why would we immediately issue a cache flush? > > This seems like a setup for the flush racing against the write, which means > > that the write could happen after the flush, which would be bad. > > > > Jeff Moyer proposed a patchset last spring[1] that removed the -EIOCBQUEUED > > case and deferred the flush issue to each filesystem's end_io handler. Google > > doesn't find any NAKs, but the patches don't seem to have gone anywhere. Is > > there a technical reason why this patches haven't gone anywhere? > > I never got the sign-off on the xfs bits, and I then got distracted with > other work. I'll see about updating the patch set. > > > Could one establish an end_io handler in blkdev_direct_IO so that async writes > > to an O_SYNC+DIO block device will result in a blkdev_issue_flush before > > aio_complete? That would seem to fix the problem of the write and flush race. > > You mean like patch 1 in that series, or something different? The original patchset doesn't seem to modify the block device aio code -- I think blkdev_direct_IO needs to pass DIO_SYNC_WRITES to __blockdev_direct_IO, and the -EIOCBQUEUED check needs to be taken out of blkdev_aio_write. I also observed a crash in the queue_work call when running against a block device. For block devices, it looks like in do_blockdev_direct_IO, iocb->ki_filp->f_mapping->host is an inode in the bdev filesystem, and iocb->ki_filp->f_dentry->d_inode->i_sb is whichever inode the user accessed (probably devtmpfs or something). The two are of course equal for regular files. Since block devices are (I think) part of their own bdev filesystem, I think it makes more sense if we call queue_work against the flush_wq of the bdev fs, not the fs that just happened to contain the device file. Will send patches after I clean 'em up and test them a bit more. --D > > Cheers, > Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html