Re: [PATCH v2 0/5] clean up collapse range and handle post-eof delalloc

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 12 Sep 2014 15:51:29 -0400

On Fri, Sep 12, 2014 at 07:19:15AM +1000, Dave Chinner wrote:
> On Thu, Sep 11, 2014 at 11:20:13AM -0400, Brian Foster wrote:
> > On Thu, Sep 11, 2014 at 02:42:43PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 10, 2014 at 09:20:26AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > Here's v2 of the collapse clean up. We refactor a bit more via the
> > > > insertion of patch 3, otherwise it's similar to v1. This will see some
> > > > continued testing, but it survived ~500m fsx operations overnight.
> > > > 
> > > > Brian
> > > 
> > > I'm not sure about the invalidation patch now. On a 1k block size
> > > filesystem, generic/127 fails with:
> > > 
> > >     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> > >     +collapse range: 1000 to 3000
> > >     +do_collapse_range: fallocate: Device or resource busy
> > > 
> > > which indicates we had an invalidation failure. This is probably
> > > exposing some other bug, but I haven't had time to look into it yet
> > > so I don't know.
> > > 
> > 
> > Yeah, I can reproduce this as well, thanks. I think you're referring to
> > the xfs_free_file_space() patch (5/5)..?
> 
> *nod*
> 
> > FWIW, I don't see the problem
> > without that patch, so it appears that the full pagecache truncate is
> > still covering up a problem somewhere. I'll try to dig into it...
> 
> It's likely that it is leaving a dirty buffer on the page beyond EOF
> as a result of the truncate zeroing the remainder of the page in
> memory.
> 
> If I get a chance I'll look at it this afternoon, as this patchset
> also seems to be causing a marked increase in the number of fsstress
> failures due to stray delalloc blocks in files even on 4k block size
> filesystems (e.g. at unmount).
> 

I think I have a bit of an idea of what's going on here. I'm not sure
that this one is post-eof delalloc related. For reference, here's the
command that reproduces for me 100% of the time on a 1k block fs
(derived from the fsx trace): 

xfs_io -fc "pwrite 185332 55756" \
	-c "fcollapse 28672 40960" \
	-c "pwrite 133228 63394" \
	-c "fcollapse 0 4096" /mnt/file

The first write extends the file to 241088 bytes and the first collapse
targets a hole, but shrinks the file down to 200128 bytes. The last page
offset of the file is now 196608 and with 1k blocks, eof falls within
the last block of this page (e.g., within bh offsets 199680-200704). The
second write falls just into the first block of this page.

What I see occur on the final collapse is that the
invalidate_inode_pages2_range() call in the collapse op fails due to
finding a dirty page at offset 196608. We don't have a launder_page()
callback, so this results in -EBUSY.

Given the above, it seems like the filemap_write_and_wait_range() call
should handle this. Digging further, what I see is one or two
writepage()->xfs_cluster_write() sequences along the range of the
previous write. xfs_convert_page() eventually attempts to handle page
offset 196608 and breaks out at offset 197632 (the second bh in the
page) near the top of the bh loop:

		...
		if (!(PageUptodate(page) || buffer_uptodate(bh))) {
			done = 1;
			break;
		}
		...

I suspect the reason the page/buffer is in this particular state is due
to the previous invalidation (collapse 1), which would have removed the
page from the mapping. This seems reasonable to me since we only wrote
into the first bytes of the page since the prior invalidation. The problem is
that the page somehow has the following buffer_head state at the point of the
EBUSY inval failure:

invalidate_inode_pages2_range(648): state 0x29 block 84 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x0 block 18446744073709551615 size 1024
invalidate_inode_pages2_range(648): state 0x2b block 87 size 1024

The first block is BH_Uptodate|BH_Req|BH_Mapped. I read the next two as
simply not up to date..? The last block is the same as the first plus
BH_Dirty. This last block is offset 199680 and the previous write goes
to 196622, so I'm not sure how this ends up dirty. I think the write
path to this range of the file might be the spot to dig next...

Brian

P.S., As a datapoint from experimentation, this problem doesn't occur if
I ensure that writepage() handles this particular page rather than
xfs_convert_page(). I can do that by either jumping out of
xfs_convert_page() sooner, before the page is set for writeback, or
hacking xfs_start_page_writeback() to use __test_set_page_writeback()
and keep the write tag such that write_cache_pages() can still find the
page. This calls out an interesting bit of behavior in
xfs_convert_page() since commit a49935f2: if we handle a part of a
page, break and mark the page for writeback without handling the rest,
we'll still clear the PAGECACHE_TAG_TOWRITE tag and writeback won't
finish off the page during the current iteration (for WB_SYNC_ALL).

It's not clear to me if this is contributing to the problem in this
particular case, but it seems like an independent bug at the very
least. Thoughts?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs