On Sun, Jan 16, 2011 at 07:28:27PM -0500, Lachlan McIlroy wrote: > ----- Original Message ----- > > Hi Dave, > > > > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote: > > > I've noticed a few suspicious things trying to reproduce the > > > allocate-in-the-middle-of-a-delalloc-extent, > > ... > > > Secondly, I think we have the same expose-the-entire-delalloc-extent > > > -to-stale-data-exposure problem in ->writepage. This onnne, however, > > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent > > > the first time any part of it is written to. Even if we are only > > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc > > > extent covers gigabytes. So, same problem when we crash. > > > > > > Finally, I think the extsize based problem exposed by test 229 is a > > > also a result of allocating space we have no pages covering in the > > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated > > > space is never zeroed and hence exposes stale data. > > > > This is precisely the bug I was going after when I hit the > > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race > > between > > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state > > convert. When xfs_page_state_convert allocates a real extent for a > > page > > toward the beginning of a delalloc extent, XFS_BMAPI converts the > > entire > > delalloc extent. Any subsequent writes into the page cache toward the > > end of this freshly allocated extent will see a written extent instead > > of delalloc and read the block from disk into the page before writing > > over it. If the write does not cover the entire page garbage from disk > > will be exposed into the page cache. > > Ben, take a look at the XFS gap list code in IRIX - this code was > designed specifically to handle this problem. It's also implemented in > several of the cxfs clients too. On entry to a write() it will create a > list of all the holes that exist in the write range before any delayed > allocations are created. The first call to xfs_bmapi() sets up the delayed > allocation for the entire write (even if the bmaps returned don't cover the > full write range). If the write results in a fault from disk then it checks > the gap list to see if any sections of the buffer used to cover a hole and > if so it ignores the state of the extent and zeroes those region(s) in the > buffer that match the pre-existing holes. If the buffer has multiple > non-hole sections that need to be read from disk the whole buffer will be > read from disk and the zeroing of the holes is done post I/O - this reduces > the number of I/Os to be done. The whole delayed allocation can be safely > converted at any time without risk of reading exposed data (assuming no > crash that is). IMO, that caveat exposes the problem with this approach - it is not persistent. It adds a lot of complexity but doesn't solve the underlying problem: that we are exposing real extents via allocation without having written any data to them. > As the write progresses through the range it removes > sections from the front of the gap list so by the time the write is complete > the gap list is empty. The gap list does not have a dedicated lock to > protect it but instead relies on the iolock to ensure that only one write > operation occurs at once (so it's not appropriate for direct I/O). > > > > > <snip> > > > > > I'm sure there are other ways to solve these problems, but these two > > > are the ones that come to mind for me. I'm open to other solutions > > > or ways to improve on these ones, especially if they are simpler. ;) > > > Anyone got any ideas or improvements? > > > > The direction I've been taking is to use XFS_BMAPI_EXACT in > > *xfs_iomap_write_allocate to ensure that an extent covering exactly > > the > > pages we're prepared to write out immediately is allocated and the > > rest > > of the delalloc extent is left as is. This exercises some of the btree > > code more heavily and led to the discovery of the > > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a > > performance issue which I've tried to resolve by extending > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages > > to be converted before performing the allocation and hold those locks > > until they are submitted for writeback. It's not very pretty but it > > resolves the corruption. > > > > There is still the issue of crashes... This could be solved by > > converting from delalloc to unwritten in xfs_page_state_convert in > > this > > very exact way and then to written in the io completion handler. Never > > go delalloc->written directly. > > That solution would probably make the gap list redundant too. Yes, I think you are right. The "gaps" would get mapped as unwritten rather than as normal extents and hence get zereod appropriately. It would also preventing exposure after a crash due to being persistent. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs