On Thu, Jul 29, 2010 at 08:53:24PM -0600, Matthew Wilcox wrote: > On Fri, Jul 30, 2010 at 08:45:16AM +1000, Dave Chinner wrote: > > If we get two unaligned direct IO's to the same filesystem block > > that is marked as a new allocation (i.e. buffer_new), then both IOs will > > zero the portion of the block they are not writing data to. As a > > result, when the IOs complete there will be a portion of the block > > that contains zeros from the last IO to complete rather than the > > data that should be there. > > Urgh. Yuck. > > > This is easily manifested by qemu using aio+dio with an unaligned > > guest filesystem - every IO is unaligned and fileystem corruption is > > encountered in the guest filesystem. xfstest 240 (from Eric Sandeen) > > is also a simple reproducer. > > > > To avoid this problem, track unaligned IO that triggers sub-block zeroing and > > check new incoming unaligned IO that require sub-block zeroing against that > > list. If we get an overlap where the start and end of unaligned IOs hit the > > same filesystem block, then we need to block the incoming IOs until the IO that > > is zeroing the block completes. The blocked IO can then continue without > > needing to do any zeroing and hence won't overwrite valid data with zeros. > > Urgh. Yuck. It's better than silent data corruption. > Could we perhaps handle this by making an IO instantiate a page cache > page for partial writes, and forcing that portion of the IO through the > page cache? The second IO would hit the same page and use the existing > O_DIRECT vs page cache paths. I don't want any direct IO for XFS to go through the page cache - unaligned or not. using the page cache for the unaligned blocks would also be much worse for performance that this method because it turns unaligned direct IO into 3 IOs - the unaligned head block, the aligned body and the unaligned tail block. It would also be a performance hit you take on every single dio, whereas this way the hit is only taken when an overlap is detected. And besides, such decisions on whether to use buffered IO have to be made high up in the filesystem when we are deciding how to lock the inode for the dio - buffered IO requires exclusive locking, not the shared locking we do for dio writes. That further reduces the performance of unaligned direct IO even when there are no overlaps, and it's a solution that every filesystem needs to implement themselves in some way. I've looked at a couple of different ways to fix this (e.g. passing the unaligned state to get_blocks() to allow the filesystem to serialise there) but they've all died a horrible death of dodgy locking and hacky IO completion detection. not to mention requiring a different solution in every filesystem. This way may be a bit ugly, but it works, is isolated and seems to me to be the least-worst way of solving the problem. It could be made to scale better by making the tracking per-inode, but I don't think that growing the struct inode for this corner case is a good tradeoff, either... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html