Re: Issues with delalloc->real extent allocation

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 18 Jan 2011 12:44:51 +1100

On Mon, Jan 17, 2011 at 02:12:40PM -0600, bpm@xxxxxxx wrote:
> Hey Dave,
> 
> On Sat, Jan 15, 2011 at 10:55:49AM +1100, Dave Chinner wrote:
> > On Fri, Jan 14, 2011 at 03:43:34PM -0600, bpm@xxxxxxx wrote:
> > > It also presents a
> > > performance issue which I've tried to resolve by extending
> > > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> > > to be converted before performing the allocation and hold those locks
> > > until they are submitted for writeback.  It's not very pretty but it
> > > resolves the corruption.
> > 
> > If we zero the relevant range in the page cache at .aio_write level
> > like we do with xfs_zero_eof or allocate unwritten extents instead,
> > then I don't think that you need to make changes like this. 
> 
> Ganging up pages under lock in xfs_page_state_convert (along with
> exactness in xfs_iomap_write_allocate) was needed to provide exclusion
> with block_prepare_write because zeroing isn't done in the case of
> written extents.
> 
> Converting from delalloc->unwritten has the advantage of
> __xfs_get_blocks setting each buffer 'new' when you write into the page,
> so the zeroing is done properly even if you convert the entire extent to
> unwritten in xfs_vm_writepage instead of just the part you're going to
> write out. 

That's definitely the advantage to using unwritten extents in this
case. However, it also means that all the buffers that were
already set up at the time of delalloc->unwritten as buffer_delay()
are now incorrect - the underlying extent is now unwritten, no
delayed allocation. Hence when it comes to writing buffer_delay()
buffers, we would also have to handle the case that they are really
buffer_unwritten(). That could be very nasty....

> However, when converting from unwritten->written in the
> completion handler you still need to convert only the part of the extent
> that was actually written.  That might be a lot of transactions in
> xfs_end_io.

Yes, that was my original concern fo the general case and
why I was looking at an intent/done transaction arrangement rather
than unwritten/conversion arrangement. The "done" part of the
transaction is much less overhead, and we can safely do the
transaction reservation (i.e. the blocking bits) before the IO is
issued so that we keep the loverhead on the completion side down to
an absolute minimum....

However, when tracking down an assert failure reported by Christoph
with the new speculative delalloc code this morning, I've realised
that extent alignment for delayed allocation in xfs_bmapi() is badly
broken. It can result in attempting to set up a delalloc extent
larger than the maximum extent size because it aligns extents by
extending them.  Hence if the extent being allocated is MAXEXTLEN in
length, the extent size alignment will extend it and we'll do nasty
stuff to the extent record we insert into the incore extent tree.

To make matters worse, extent size is not limited to the maximum
supported extent size in the filesystem. The extent size can be
set to (2^32 bytes - 1 fsblock), but for 512 byte block size
filesystems the maximum extent size is 2^21 * 2^9 = 2^30. You can't
align extents to a size larger than that maximum extent size, and if
you try:

# sudo mkfs.xfs -b size=512 -f /dev/vda
...
# xfs_io -f -c "extsize 2g" \
> -c "ftruncate 4g" \
> -c "pwrite 64k 512" \
> -c "bmap -vp" /mnt/test/foo
XFS: Assertion failed: k < (1 << STARTBLOCKVALBITS), file: fs/xfs/xfs_bmap_btree.h, line: 83
....
Call Trace:
  [<ffffffff8145998c>] xfs_bmapi+0x167c/0x1a20
  [<ffffffff81488be2>] xfs_iomap_write_delay+0x1c2/0x340
  [<ffffffff814a7b62>] __xfs_get_blocks+0x402/0x4d0
  [<ffffffff814a7c61>] xfs_get_blocks+0x11/0x20
  [<ffffffff8118f1fb>] __block_write_begin+0x20b/0x5a0
  [<ffffffff8118f6f6>] block_write_begin+0x56/0x90
  [<ffffffff814a7541>] xfs_vm_write_begin+0x41/0x70
  [<ffffffff81118c06>] generic_file_buffered_write+0x106/0x270
  [<ffffffff814ae7a2>] xfs_file_buffered_aio_write+0xd2/0x190
  [<ffffffff814aece2>] xfs_file_aio_write+0x1c2/0x310
  [<ffffffff8116052a>] do_sync_write+0xda/0x120
  [<ffffffff81160850>] vfs_write+0xd0/0x190
  [<ffffffff81161262>] sys_pwrite64+0x82/0xa0
  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b

Which failed this check:

81      static inline xfs_fsblock_t nullstartblock(int k)
82      {
83  >>>>>>      ASSERT(k < (1 << STARTBLOCKVALBITS));
84              return STARTBLOCKMASK | (k);
85      }

You end up with a corrupt extent record in the inncore extent list.

Along the same lines, extsize > ag size for non-rt inode is also
invalid. It won't be handled correctly by the delalloc->unwritten
method because delalloc conversion is currently limited to a single
allocation per .writepage call to avoid races with truncate. Hence
when extsize is larger than an AG, we'll only allocate the part
within an a single AG and hence leave part of the delalloc range in
the delalloc state. With no pages covering this range, it'll never
get written or allocated. Not to mention that it'll  trip assert
failures in xfs_getbmap:

# sudo mkfs.xfs -b size=4k -f-dagsize=1g /dev/vda
...
# xfs_io -f -c "extsize 2g" \
> -c "ftruncate 4g" \
> -c "pwrite 64k 512" \
> -c "bmap -vp" /mnt/test/foo
XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) || (map[i].br_startblock != DELAYSTARTBLOCK) file: fs/xfs/xfs_bmap.c, line: 5558
.....

These problems indicate to me that doing extent size alignment for
delayed allocation inside xfs_bmapi() is wrong on many levels.
Moving the "alignment" to zeroing at the .aio_write level solves
these issues except for one detail - ensuring that speculative
delalloc beyond EOF is correctly aligned - this can be handled easily
by the prealloc code.

Hence I'm thinking that the solution we should be implementing is:

	1. Limit extsize values to sane alignment boundaries.

	2. Do extsize alignment for delalloc by zeroing the page
	   cache at the .aio_write level for regions not covered by
	   the write.

	3. Ensure that speculative delalloc is extsize aligned

	4. Rip the delalloc extent alignment code from xfs_bmapi()

	5. Implement alloc intent/done transactions to protect
	   against stale data exposure caused by crashing between
	   allocation and data write completion (*).

I think that covers all the issues we've discussed - have I missed
anything?

Cheers,

Dave.

(*) FWIW, since I proposed the intent/done method, I've realised it
also solves the main difficulty in implementing data-in-inode
safely: how to sycnhronise the unlogged data write when we move the
data out of the inode with the allocation transaction so we know
when we can allow the tail of the log to move past the allocation
transaction or whether we can replay the format change transaction
in log recovery.....

-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs