Re: Oops while rebalancing, now unmountable.

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Mon, 15 Nov 2010 20:29:14 +0100

On Mon, Nov 15, 2010 at 02:12:04PM -0500, Christoph Hellwig wrote:
> I didn't even notice that, but the WB_SYNC_NONE does indeed seem
> buggy to me.  If we set the sync_mode to WB_SYNC_NONE filesystem
> can and frequently do trylock operations and might just skip to
> write it out completely.

Scary stuff, so WB_SYNC_NONE wouldn't submit the dirty part of the
page down for I/O, so that it's all clean after wait_on_page_writeback
returns? (well of course unless the dirty bit was set again)

> So we defintively do need to change writeout to do a WB_SYNC_ALL
> writeback.  In addition to that we'll also need the
> wait_on_page_writeback call to make sure we actually wait for I/O
> to finish.

Ok that is ok... I misread it sorry. But the writeback must be started
by WB_SYNC_NONE (or _ALL) for wait_on_page_writeback to be effective.

migration will abort if ->writepage returns error, that's safe
though. It will retry calling on wait_on_page_writeback only if
->writepage returns 0.

> Also what protects us from updating the page while we write it out?
> PG_writeback on many filesystems doesn't protect writes from modifying
> the in-flight buffer, and just locking the page after ->writepage
> is racy without a check that nothing changed.

migrate established migration ptes already so nobody can write to the
page through pagetables. The only thing left is O_DIRECT which is also
taken care by the page count check in migrate_page_move_mapping,
before migrate_page called by fallback_migrate_page can succeed. So
nothing can be modifying the page if we go ahead with migrate_page
(and no pte dirty bit can happen either). The page is also locked down
for the whole migration so all writes syscalls should be stopped.

> kswapd is fine.  Other task allocation memory are direct reclaimers.
> Direct reclaim through the filesystem delalloc conversion and the I/O
> stack guarantees you stack overflows, that's why filesystems refuse
> to do anything in ->writepage for this case.  btrfs and XFS have
> explicit checks for PF_MEMALLOC (with a carve out for kswapd in XFS),
> and ext4 only writes already allocated blocks in ->writepage but never
> does delalloc conversions.

I didn't realize the stack overflow issue was specific to delalloc. I
think it's ok here to skip ->writepage for delalloc, it's not
mandatory, memory compaction isn't supposed to do much I/O anyway,
it's supposed to copy ram instead. Sure it'd be more reliable to
submit I/O but it's going to work pretty well, plus compaction will be
retried again later by khugepaged once every 10 sec. kswapd actually
with THP will not do anything because THP allocations are run with
__GFP_NO_KSWAPD to avoid kswapd to waste cpu by trying in the
background hard to create hugepages if 90% of ram goes in anonymous
memory (and there are background anon allocations that would wakeup
kswapd) but only 80% can be allocated as 2M contiguous beacuse 20% was
at some point allocated in slab caches.

In short with THP it's khugepaged that is supposed to run the
->writepage in migrate.c and it will run it once every 10 sec even
when it fails (and not in a 100% cpu wasting loop like kswapd), so if
you did something magic for kswapd in XFS you should do for khugepaged
too.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html