Folks, I've noticed a few suspicious things trying to reproduce the allocate-in-the-middle-of-a-delalloc-extent, Firstly preallocation into delalloc ranges behaves in a unexpected and dangerous way. Preallocation uses unwritten extents to avoid stale data exposure, but when issued into a range that is covered by a delalloc extent, it does not use unwritten extents. The code that causes this is in xfs_bmapi(): /* * Determine state of extent, and the filesystem. * A wasdelay extent has been initialized, so * shouldn't be flagged as unwritten. */ if (wr && xfs_sb_version_hasextflgbit(&mp->m_sb)) { if (!wasdelay && (flags & XFS_BMAPI_PREALLOC)) got.br_state = XFS_EXT_UNWRITTEN; } This seems to be incorrect to me - a "wasdelay" extent has not yet been initialised - there's data in memory, but there is nothing on disk and we may not write it for some time. If we crash after this transaction is written but before any data is written, we expose stale data. Not only that, it allocates the _entire_ delalloc extent that spans the preallocation range, even when the preallocation range is only 1 block and the delalloc extent covers gigabytes. hence we actually expose a much greater range of the file to stale data exposure during a crash than just eh preallocated range. Not good. Secondly, I think we have the same expose-the-entire-delalloc-extent -to-stale-data-exposure problem in ->writepage. This onnne, however, is due to using BMAPI_ENTIRE to allocate the entire delalloc extent the first time any part of it is written to. Even if we are only writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc extent covers gigabytes. So, same problem when we crash. Finally, I think the extsize based problem exposed by test 229 is a also a result of allocating space we have no pages covering in the page cache (triggered by BMAPI_ENTIRE allocation) so the allocated space is never zeroed and hence exposes stale data. A possible solution to all of these problems is to allocate delalloc space as unwritten extents. It's obvious to see how this solves the first two cases, but the last one is a bit less obvious. That one is solved by the fact that unwritten extent conversion does not get extent size aligned, so any block we don't write data for will remain in the unwritten state and therefore correctly return zeros for any read... The issues with this approach are really about the cost of unwritten extent conversion and the impact on low memory operation. The cost is mainly a CPU cost due to delayed logging, but that will be significant if we have to do an unwritten conversion on every IO completion. Batching is definitely possible and would mostly alleviate the overhead, but does add complexity and especially w.r.t. unwritten extent conversion sometimes requiring allocation to complete. The impact on low memory behaviour is less easy to define and handle. Extent conversion can block needing memory, so if we need to complete the extent conversion to free memory we deadlock. For delalloc writes, we can probably mark IO complete and pages clean before we do the conversion in a batched, async manner. That would leaves us with the same structure as we currently have, but adds more complexity because we now need to track extent ranges in "conversion pending" state for subsequent reads and sync operations. i.e. an extent in a pending state is treated like a real extent for the purposes of reading, but conversion needs to be completed during a sync operation. Another possibility for just the ->writepage problem is that we could make delalloc allocation in ->writepage more like and EFI/EFD type operation. That is, we allocate the space and log all the changes in an allocation intent transaction (basically the same as the current allocation transaction) and then add an allocation done transaction that is issued at IO completion that basically manipulation won't require IO to complete. Effectively this would be logging all the xfs_ioend structures. Then in log recovery we simply convert any range that does not have a done transaction to unwritten, thereby preventing stale data exposure. The down side to this approach is that it needs new transactions and log recovery code, so is not backwards compatible. I think is probably a simpler solution with lower overhead compared to allocating unwritten extents everywhere. It doesn't solve the extsize problem, but that probably should be handled up at the ->aio_write level, not at the ->write_page level. Similarly, the preallocation issue could easily be handled with small changes to xfs_bmapi().... I'm sure there are other ways to solve these problems, but these two are the ones that come to mind for me. I'm open to other solutions or ways to improve on these ones, especially if they are simpler. ;) Anyone got any ideas or improvements? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs