Re: [LSF/MM TOPIC] COWing writeback pages

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 10 Feb 2012 12:49:50 -0800 (PST)

On Fri, 10 Feb 2012, Josef Bacik wrote:
> On Fri, Feb 10, 2012 at 11:25:27AM -0800, Sage Weil wrote:
> > Hi everyone,
> > 
> > The takeaway from the 'stable pages' discussions in the last few workshops 
> > was that pages under writeback should remain locked so that subsequent 
> > writers don't touch them while they are en route to the disk.  This 
> > prevents bad checksums and DIF/DIX type failures (whereas previously we 
> > didn't really care whether old or new data reached the disk).
> > 
> > The fear is/was that anyone subsequently modifying the page will have to 
> > wait for writeback io to complete before continuing.  I seem to remember 
> > somebody (Martin?) saying that in practice, under "real" workloads, that 
> > doesn't actually happen, so don't worry about it.  (Does anyone remember 
> > the details of what testing led to that conclusion?)
> > 
> > Anyway, we are seeing what looks like an analogous problem with btrfs, 
> > where operations sometimes block waiting for writeback of the btree pages.  
> > Although the 'keep rewriting the same page' pattern may not be prevalent 
> > in normal file workloads, it does seem to happen with the btrfs btree.
> > 
> > The obvious solution seems to be to COW the page if it is under writeback 
> > and we want to remodify it.  Presumably that can be done just in btrfs, to 
> > address the btrfs-specific symptoms we're hitting, but I'm interested in 
> > hearing from other folks about whether it's more generally useful VM 
> > functionality for other filesystems and other workloads.
> > 
> > Unfortunately, we haven't been able to pinpoint the exact scenarios under 
> > which this triggers under btrfs.  We regularly see long stalls for 
> > metadata operations (create() and similar metadata-only operations) that 
> > block after btrfs_commit_transaction has "finished" the previous 
> > transaction and is doing
> > 
> > 		return filemap_write_and_wait(btree_inode->i_mapping);
> > 
> > What we're less clear about is when btrfs will modify the in-memory page 
> > in place (and thus wait) versus COWing the page... still digging into this 
> > now.
> > 
> 
> Heh so I'm working on this now, specifically in the heavy create() workload, and
> I've just about got it nailed down.  A lot of this problem is because we rely on
> normal pagecache for our metadata so I'm copying xfs and creating our own
> caching.
> 
> The thing is since we have an inode hanging out with normal pagecache pages we
> can have multiple people trying to write out dirty pages in our inode at the
> same time, and since it goes through our normal write path we'll end up in this
> case where we're waiting on writeback for pages we won't actually end up writing
> out.  My code will fix this, if we're talking about the same problem ;).

Oh, I hadn't thought of that... that sounds like a similar but slightly 
different problem, since it probably wouldn't correlate with the 
filemap_write_and_wait().  As long as we don't have a btree update waiting 
on btree writeback, though, both problems should be addressed.

In any case, we're definitely interested in checking out the code when 
it's ready to share!

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html