Re: Issues with delalloc->real extent allocation

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Wed, 19 Jan 2011 07:03:21 -0500

On Wed, Jan 19, 2011 at 10:18:32AM +1100, Dave Chinner wrote:
> Except for the fact we use the delalloc state from the buffer to
> trigger allocation after mapping. We could probably just use
> isnullstartblock() for this - only a mapping from a buffer over a
> delalloc range should return a null startblock.

isnullstartblock returns true for both DELAYSTARTBLOCK and
HOLESTARTBLOCK, so we want to be explicit we can check for
br_startblock == DELAYSTARTBLOCK.

Note that we also need to explicitly check for br_state == 
XFS_EXT_UNWRITTEN to set the correct type for the ioend structure.

> XFS_BMAPI_IGSTATE does a couple of things. Firstly, It prevents
> xfs_bmapi() from allocating new extents (turns off write mode).
> This isn't an issue where it is used because neither of the call
> sites set XFS_BMAPI_WRITE.

I've been down enough to understand what it does.  But yes, the
single large I/O mapping might explain why we want it.  The next
version of xfs_map_blocks will get a comment for it..

> In fact, we probably should be setting this for normal written
> extents as well, so that the case:
> 
> 	A               B                 C
> 	+---------------+-----------------+
> 	    written          unwritten
> 
> is also handled with the same optimisation. That makes handling
> unwritten and overwrites identical, with only delalloc being
> different. If we assume delalloc when we get a null startblock,
> then we don't need to look at the buffer state at all for the
> initial mapping.

With the current xfs_bmapi code that won't work.  When merging a second
extent into the first we only add up the br_blockcount.  So for the
case above we'd get an extent returned that's not marked as unwrittent
and we wouldn't mark the ioend as unwrittent and thus perform not
extent conversion after I/O completion.  Just adding XFS_BMAPI_IGSTATE
blindly for the delalloc case on the other hand is fine, as the
merging of delayed extents is handled in a different if clause that
totally ignores XFS_BMAPI_IGSTATE.

The potention fix for this is to always set br_state if one of the
merged extents is an unwrittent extent. The only other caller is
xfs_getbmap which does report the unwrittent state to userspace,
but already is sloppy for merging the other way if BMV_IF_PREALLOC
is not set, so I can't see how beening sloppy this way to makes any
difference.

> It seems to me that with such modifications, the only thing that we
> are using the bufferhead for is the buffer_uptodate() flag to
> determine if we should write the block or not. If we can find some
> other way of holding this state then we don't need bufferheads in
> the write path at all, right?

There's really two steps.  First we can stop needing buffers headers
for the space allocation / mapping.  This is doable with the slight
tweak of XFS_BMAPI_IGSTATE semantics.

We still do need to set BH_Delay or BH_Unwrittent for use in
__block_write_begin and block_truncate_page, but they become completely
interchangeable at that point.

If we want to completely get rid of buffers heads things are a bit
more complicated.  It's doable as shown by the _nobh aops, but we'll
use quite a bit of per-block state that needs to be replaced by per-page
state, and we'll lose the way to cache the block number in the buffer
head.  While we don't make use of that in writepage we do so in
the write path, although I'm not sure how important it is.  If we
get your multi-page write work in it probably won't matter any more.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs