Re: Issues with delalloc->real extent allocation

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 20 Jan 2011 00:31:47 +1100




On Wed, Jan 19, 2011 at 07:03:21AM -0500, Christoph Hellwig wrote:
> On Wed, Jan 19, 2011 at 10:18:32AM +1100, Dave Chinner wrote:
> > Except for the fact we use the delalloc state from the buffer to
> > trigger allocation after mapping. We could probably just use
> > isnullstartblock() for this - only a mapping from a buffer over a
> > delalloc range should return a null startblock.
> 
> isnullstartblock returns true for both DELAYSTARTBLOCK and
> HOLESTARTBLOCK, so we want to be explicit we can check for
> br_startblock == DELAYSTARTBLOCK.

True.

> 
> Note that we also need to explicitly check for br_state == 
> XFS_EXT_UNWRITTEN to set the correct type for the ioend structure.

Yes. As i mentioned on IRC I hacked a  quick prototype together to
test this out and did exactly this. ;)

> 
> > XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents
> > xfs_bmapi() from allocating new extents (turns off write mode).
> > This isn't an issue where it is used because neither of the call
> > sites set XFS_BMAPI_WRITE.
> 
> I've been down enough to understand what it does.  But yes, the
> single large I/O mapping might explain why we want it.  The next
> version of xfs_map_blocks will get a comment for it..
> 
> > In fact, we probably should be setting this for normal written
> > extents as well, so that the case:
> > 
> > 	A               B                 C
> > 	+---------------+-----------------+
> > 	    written          unwritten
> > 
> > is also handled with the same optimisation. That makes handling
> > unwritten and overwrites identical, with only delalloc being
> > different. If we assume delalloc when we get a null startblock,
> > then we don't need to look at the buffer state at all for the
> > initial mapping.
> 
> With the current xfs_bmapi code that won't work.  When merging a second
> extent into the first we only add up the br_blockcount.  So for the
> case above we'd get an extent returned that's not marked as unwrittent
> and we wouldn't mark the ioend as unwrittent and thus perform not
> extent conversion after I/O completion.  Just adding XFS_BMAPI_IGSTATE
> blindly for the delalloc case on the other hand is fine, as the
> merging of delayed extents is handled in a different if clause that
> totally ignores XFS_BMAPI_IGSTATE.
>
> The potention fix for this is to always set br_state if one of the
> merged extents is an unwrittent extent. The only other caller is
> xfs_getbmap which does report the unwrittent state to userspace,
> but already is sloppy for merging the other way if BMV_IF_PREALLOC
> is not set, so I can't see how beening sloppy this way to makes any
> difference.

Yup:


@@ -4827,6 +4827,18 @@ xfs_bmapi(
 			ASSERT(mval->br_startoff ==
 			       mval[-1].br_startoff + mval[-1].br_blockcount);
 			mval[-1].br_blockcount += mval->br_blockcount;
+			/*
+			 * if one of the extent types is unwritten, make sure
+			 * the extent is reported as unwritten so the caller
+			 * always takes the correct action for unwritten
+			 * extents. This means we always return consistent
+			 * state regardless of whether we find a written or
+			 * unwritten extent first.
+			 */
+			if (mval[-1].br_state != XFS_EXT_UNWRITTEN &&
+			    mval->br_state == XFS_EXT_UNWRITTEN) {
+				mval[-1].br_state = XFS_EXT_UNWRITTEN;
+			}
 		} else if (n > 0 &&
 			   mval->br_startblock == DELAYSTARTBLOCK &&
 			   mval[-1].br_startblock == DELAYSTARTBLOCK &&

> > It seems to me that with such modifications, the only thing that we
> > are using the bufferhead for is the buffer_uptodate() flag to
> > determine if we should write the block or not. If we can find some
> > other way of holding this state then we don't need bufferheads in
> > the write path at all, right?
> 
> There's really two steps.  First we can stop needing buffers headers
> for the space allocation / mapping.  This is doable with the slight
> tweak of XFS_BMAPI_IGSTATE semantics.
> 
> We still do need to set BH_Delay or BH_Unwrittent for use in
> __block_write_begin and block_truncate_page, but they become completely
> interchangeable at that point.
> 
> If we want to completely get rid of buffers heads things are a bit
> more complicated.  It's doable as shown by the _nobh aops, but we'll
> use quite a bit of per-block state that needs to be replaced by per-page
> state,

Sure, or use a similar method to btrfs which stores dirty state bits
in a separate extent tree. Worst case memory usage is still much
less than a bufferhead per block...

> and we'll lose the way to cache the block number in the buffer
> head.  While we don't make use of that in writepage we do so in
> the write path, although I'm not sure how important it is.  If we
> get your multi-page write work in it probably won't matter any more.

The only place we use bh->b_blocknr is for ioend manipulation. Am I
missing something else?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs