Re: Oops while rebalancing, now unmountable.

Chris Mason <chris.mason@xxxxxxxxxx> · Mon, 15 Nov 2010 13:46:02 -0500

Excerpts from Christoph Hellwig's message of 2010-11-15 13:23:14 -0500:
> On Sun, Nov 14, 2010 at 11:12:22PM +0100, Andrea Arcangeli wrote:
> > I just wrote above that it can happen upstream without THP. It's not
> > THP related at all. THP is the consumer, this is a problem in migrate
> > that will trigger as well with migrate_pages or all other possible
> > migration APIs.
> > 
> > If more people would be using hugetlbfs they would have noticed
> > without THP.
> 
> Okay, it seems THP is really just the messenger for bad VM practices
> here.
> 
> > +static int btree_migratepage(struct address_space *mapping,
> > +                       struct page *newpage, struct page *page)
> > +{
> > +       /*
> > +        * we can't safely write a btree page from here,
> > +        * we haven't done the locking hook
> > +        */
> > +       if (PageDirty(page))
> > +               return -EAGAIN;
> > 
> > fallback_migrate_page would call writeout() which is apparently not
> > ok in btrfs for locking issues leading to corruption.
> 
> Hmm, it seems the issue for that particular problem is indeedin btrfs.
> If it needs external locking for writing out data it should not
> implement ->writepage to start with.  Chris, can you explain what's
> going on with the btree code? It's pretty funny both in the
> btree_writepage which goes directly into extent_write_full_page
> if PF_MEMALLOC is not set, but otherwise does much more complicated
> work, and also in btree_writepages which skips various WB_SYNC_NONE,
> including the very weird check for for_kupdate.

So, I had THP + a patched btrfs running all weekend and I can safely say
I've fixed this one now. 

> 
> What's the story behing all this and the corruption that Andrea found?

For the metadata blocks, btrfs gets into a problematic lock inversion
where it needs to record that a block has been written so that it will
be properly recowed when someone tries to change it again.

Basically the rule for btree_writepage:

1) lock the extent buffer (different from the page)
2) mark the metadata block as written
3) lock the page
4) call writepage

Btrfs does this correctly everywhere it uses writepage, and everyone
else either uses writepages or is PF_MEMALLOC, except for the page
migration code, which just jumps to step 4.

So, my current fix adds a migrate page hook and adds a warning into the
code to make sure we protest loudly when the block isn't marked as
written.  Since this shakedown worked well, I'm changing the warning to
a BUG().

The check for kupdate in btree_writepages is different.  Once we write
something, we have to do a good amount of work in order to modify it
again.  The btrfs log commits make sure that we write metadata from time
to time, so we don't really need help from the flusher threads unless.

We also don't want to waste time writing metadata from
balance_dirty_pages.  It'll just make more allocations later as we
wander around and recow things, and it is much more likely to be seeky
than the file IO.  So we setup a threshold where we don't bother doing
metadata IO unless there is a good amount pending.

I'm fine with removing the metadata writepage entirely, it didn't use to
have this many rules and it seems like a better idea to have it not
there at all.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html