Issues with delalloc->real extent allocation

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 14 Jan 2011 11:29:00 +1100

Folks,

I've noticed a few suspicious things trying to reproduce the
allocate-in-the-middle-of-a-delalloc-extent,

Firstly preallocation into delalloc ranges behaves in a unexpected
and dangerous way. Preallocation uses unwritten extents to avoid
stale data exposure, but when issued into a range that is covered by
a delalloc extent, it does not use unwritten extents.

The code that
causes this is in xfs_bmapi():

		/*
		 * Determine state of extent, and the filesystem.
		 * A wasdelay extent has been initialized, so
		 * shouldn't be flagged as unwritten.
		 */
		if (wr && xfs_sb_version_hasextflgbit(&mp->m_sb)) {
			if (!wasdelay && (flags & XFS_BMAPI_PREALLOC))
				got.br_state = XFS_EXT_UNWRITTEN;
		}

This seems to be incorrect to me - a "wasdelay" extent has not yet
been initialised - there's data in memory, but there is nothing on
disk and we may not write it for some time. If we crash after this
transaction is written but before any data is written, we expose
stale data.

Not only that, it allocates the _entire_ delalloc extent that spans
the preallocation range, even when the preallocation range is only 1
block and the delalloc extent covers gigabytes. hence we actually
expose a much greater range of the file to stale data exposure
during a crash than just eh preallocated range. Not good.

Secondly, I think we have the same expose-the-entire-delalloc-extent
-to-stale-data-exposure problem in ->writepage. This onnne, however,
is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
the first time any part of it is written to. Even if we are only
writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
extent covers gigabytes. So, same problem when we crash.

Finally, I think the extsize based problem exposed by test 229 is a
also a result of allocating space we have no pages covering in the
page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
space is never zeroed and hence exposes stale data.

A possible solution to all of these problems is to allocate delalloc
space as unwritten extents.  It's obvious to see how this solves the
first two cases, but the last one is a bit less obvious. That one is
solved by the fact that unwritten extent conversion does not get
extent size aligned, so any block we don't write data for will
remain in the unwritten state and therefore correctly return zeros
for any read...

The issues with this approach are really about the cost of unwritten
extent conversion and the impact on low memory operation. The cost
is mainly a CPU cost due to delayed logging, but that will be
significant if we have to do an unwritten conversion on every IO
completion. Batching is definitely possible and would mostly
alleviate the overhead, but does add complexity and especially
w.r.t. unwritten extent conversion sometimes requiring allocation to
complete.

The impact on low memory behaviour is less easy to define and
handle. Extent conversion can block needing memory, so if we need to
complete the extent conversion to free memory we deadlock. For
delalloc writes, we can probably mark IO complete and pages clean
before we do the conversion in a batched, async manner. That would
leaves us with the same structure as we currently have, but adds
more complexity because we now need to track extent ranges in
"conversion pending" state for subsequent reads and sync operations.
i.e. an extent in a pending state is treated like a real extent for
the purposes of reading, but conversion needs to be completed during
a sync operation.

Another possibility for just the ->writepage problem is that we
could make delalloc allocation in ->writepage more like and EFI/EFD
type operation. That is, we allocate the space and log all the
changes in an allocation intent transaction (basically the same as
the current allocation transaction) and then add an allocation done
transaction that is issued at IO completion that basically
manipulation won't require IO to complete. Effectively this would be
logging all the xfs_ioend structures. Then in log recovery we simply
convert any range that does not have a done transaction to
unwritten, thereby preventing stale data exposure. The down side to
this approach is that it needs new transactions and log recovery
code, so is not backwards compatible.

I think is probably a simpler solution with lower overhead compared
to allocating unwritten extents everywhere. It doesn't solve the
extsize problem, but that probably should be handled up at the
->aio_write level, not at the ->write_page level.  Similarly, the
preallocation issue could easily be handled with small changes to
xfs_bmapi()....

I'm sure there are other ways to solve these problems, but these two
are the ones that come to mind for me.  I'm open to other solutions
or ways to improve on these ones, especially if they are simpler. ;)
Anyone got any ideas or improvements?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs