On Fri, Jul 12, 2019 at 01:20:56PM +0200, Jan Kara wrote: > On Fri 12-07-19 11:10:43, Mel Gorman wrote: > > On Fri, Jul 12, 2019 at 11:17:46AM +0200, Jan Kara wrote: > > > On Thu 11-07-19 17:04:55, Andrew Morton wrote: > > > > On Thu, 11 Jul 2019 14:58:38 +0200 Jan Kara <jack@xxxxxxx> wrote: > > > > > > > > > buffer_migrate_page_norefs() can race with bh users in a following way: > > > > > > > > > > CPU1 CPU2 > > > > > buffer_migrate_page_norefs() > > > > > buffer_migrate_lock_buffers() > > > > > checks bh refs > > > > > spin_unlock(&mapping->private_lock) > > > > > __find_get_block() > > > > > spin_lock(&mapping->private_lock) > > > > > grab bh ref > > > > > spin_unlock(&mapping->private_lock) > > > > > move page do bh work > > > > > > > > > > This can result in various issues like lost updates to buffers (i.e. > > > > > metadata corruption) or use after free issues for the old page. > > > > > > > > > > Closing this race window is relatively difficult. We could hold > > > > > mapping->private_lock in buffer_migrate_page_norefs() until we are > > > > > finished with migrating the page but the lock hold times would be rather > > > > > big. So let's revert to a more careful variant of page migration requiring > > > > > eviction of buffers on migrated page. This is effectively > > > > > fallback_migrate_page() that additionally invalidates bh LRUs in case > > > > > try_to_free_buffers() failed. > > > > > > > > Is this premature optimization? Holding ->private_lock while messing > > > > with the buffers would be the standard way of addressing this. The > > > > longer hold times *might* be an issue, but we don't know this, do we? > > > > If there are indeed such problems then they could be improved by, say, > > > > doing more of the newpage preparation prior to taking ->private_lock. > > > > > > I didn't check how long the private_lock hold times would actually be, it > > > just seems there's a lot of work done before the page is fully migrated a > > > we could release the lock. And since the lock blocks bh lookup, > > > set_page_dirty(), etc. for the whole device, it just seemed as a bad idea. > > > I don't think much of a newpage setup can be moved outside of private_lock > > > - in particular page cache replacement, page copying, page state migration > > > all need to be there so that bh code doesn't get confused. > > > > > > But I guess it's fair to measure at least ballpark numbers of what the lock > > > hold times would be to get idea whether the contention concern is > > > substantiated or not. > > > > > > > I think it would be tricky to measure and quantify how much the contention > > is an issue. While it would be possible to construct a microbenchmark that > > should illustrate the problem, it would tell us relatively little about > > how much of a problem it is generally. It would be relatively difficult > > to detect the contention and stalls in block lookups due to migration > > would be tricky to spot. Careful use of lock_stat might help but > > enabling that has consequences of its own. > > > > However, a rise in allocation failures due to dirty pages not being > > migrated is relatively easy to detect and the consequences are relatively > > benign -- failed high-order allocation that is usually ok versus a stall > > on block lookups that could have a wider general impact. > > > > On that basis, I think the patch you proposed is the more appropriate as > > a fix for the race which has the potential for data corruption. So; > > > > Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > > Thanks. And I agree with you that detecting failed migrations is generally > easier than detecting private_lock contention. Anyway, out of curiosity, I > did run thpfioscale workload in mmtests with some additional metadata > workload on the system to increase proportion of bdev page cache and added > tracepoints to see how long the relevant part of __buffer_migrate_page() > lasts (patch attached). The longest duration of the critical section was > 311 us which is significant. But that was an outlier by far. The most of > times critical section lasted couple of us. The full histogram is here: > > [min - 0.000006]: 2907 93.202950% > (0.000006 - 0.000011]: 105 3.366464% > (0.000011 - 0.000016]: 36 1.154216% > (0.000016 - 0.000021]: 45 1.442770% > (0.000021 - 0.000026]: 13 0.416800% > (0.000026 - 0.000031]: 4 0.128246% > (0.000031 - 0.000036]: 2 0.064123% > (0.000036 - 0.000041]: 2 0.064123% > (0.000041 - 0.000046]: 1 0.032062% > (0.000046 - 0.000050]: 1 0.032062% > (0.000050 - 0.000055]: 0 0.000000% > (0.000055 - 0.000060]: 0 0.000000% > (0.000060 - 0.000065]: 0 0.000000% > (0.000065 - 0.000070]: 0 0.000000% > (0.000070 - 0.000075]: 2 0.064123% > (0.000075 - 0.000080]: 0 0.000000% > (0.000080 - 0.000085]: 0 0.000000% > (0.000085 - 0.000090]: 0 0.000000% > (0.000090 - 0.000095]: 0 0.000000% > (0.000095 - max]: 1 0.032062% > > So although I still think that just failing the migration if we cannot > invalidate buffer heads is a safer choice, just extending the private_lock > protected section does not seem as bad as I was afraid. > That does not seem too bad and your revised patch looks functionally fine. I'd leave out the tracepoints though because a perf probe would have got roughly the same data and the tracepoint may be too specific to track another class of problem. Whether the tracepoint survives or not and with a changelog added; Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Andrew, which version do you want to go with, the original version or this one that holds private_lock for slightly longer during migration? > From fd584fb6fa6d48e1fae1077d2ab0021ae9c98edf Mon Sep 17 00:00:00 2001 > From: Jan Kara <jack@xxxxxxx> > Date: Wed, 10 Jul 2019 11:31:01 +0200 > Subject: [PATCH] mm: migrate: Fix race with __find_get_block() > > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > include/trace/events/migrate.h | 42 ++++++++++++++++++++++++++++++++++++++++++ > mm/migrate.c | 8 +++++++- > 2 files changed, 49 insertions(+), 1 deletion(-) > > diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h > index 705b33d1e395..15473a508216 100644 > --- a/include/trace/events/migrate.h > +++ b/include/trace/events/migrate.h > @@ -70,6 +70,48 @@ TRACE_EVENT(mm_migrate_pages, > __print_symbolic(__entry->mode, MIGRATE_MODE), > __print_symbolic(__entry->reason, MIGRATE_REASON)) > ); > + > +TRACE_EVENT(mm_migrate_buffers_begin, > + > + TP_PROTO(struct address_space *mapping, unsigned long index), > + > + TP_ARGS(mapping, index), > + > + TP_STRUCT__entry( > + __field(unsigned long, mapping) > + __field(unsigned long, index) > + ), > + > + TP_fast_assign( > + __entry->mapping = (unsigned long)mapping; > + __entry->index = index; > + ), > + > + TP_printk("mapping=%lx index=%lu", __entry->mapping, __entry->index) > +); > + > +TRACE_EVENT(mm_migrate_buffers_end, > + > + TP_PROTO(struct address_space *mapping, unsigned long index, int rc), > + > + TP_ARGS(mapping, index, rc), > + > + TP_STRUCT__entry( > + __field(unsigned long, mapping) > + __field(unsigned long, index) > + __field(int, rc) > + ), > + > + TP_fast_assign( > + __entry->mapping = (unsigned long)mapping; > + __entry->index = index; > + __entry->rc = rc; > + ), > + > + TP_printk("mapping=%lx index=%lu rc=%d", __entry->mapping, __entry->index, __entry->rc) > +); > + > + > #endif /* _TRACE_MIGRATE_H */ > > /* This part must be outside protection */ > diff --git a/mm/migrate.c b/mm/migrate.c > index e9594bc0d406..bce0f1b03a60 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -763,6 +763,7 @@ static int __buffer_migrate_page(struct address_space *mapping, > recheck_buffers: > busy = false; > spin_lock(&mapping->private_lock); > + trace_mm_migrate_buffers_begin(mapping, page->index); > bh = head; > do { > if (atomic_read(&bh->b_count)) { > @@ -771,12 +772,13 @@ static int __buffer_migrate_page(struct address_space *mapping, > } > bh = bh->b_this_page; > } while (bh != head); > - spin_unlock(&mapping->private_lock); > if (busy) { > if (invalidated) { > rc = -EAGAIN; > goto unlock_buffers; > } > + trace_mm_migrate_buffers_end(mapping, page->index, -EAGAIN); > + spin_unlock(&mapping->private_lock); > invalidate_bh_lrus(); > invalidated = true; > goto recheck_buffers; > @@ -809,6 +811,10 @@ static int __buffer_migrate_page(struct address_space *mapping, > > rc = MIGRATEPAGE_SUCCESS; > unlock_buffers: > + if (check_refs) { > + trace_mm_migrate_buffers_end(mapping, page->index, rc); > + spin_unlock(&mapping->private_lock); > + } > bh = head; > do { > unlock_buffer(bh); > -- > 2.16.4 > -- Mel Gorman SUSE Labs