On Sun, 24 May 2020, Konstantin Khlebnikov wrote: > On 24/05/2020 04.01, Hugh Dickins wrote: > > On Wed, 13 May 2020, Konstantin Khlebnikov wrote: > > > > > Function isolate_migratepages_block() runs some checks out of lru_lock > > > when choose pages for migration. After checking PageLRU() it checks extra > > > page references by comparing page_count() and page_mapcount(). Between > > > these two checks page could be removed from lru, freed and taken by slab. > > > > > > As a result this race triggers VM_BUG_ON(PageSlab()) in page_mapcount(). > > > Race window is tiny. For certain workload this happens around once a > > > year. > > > > Around once a year, that was my guess too. I have no record of us ever > > hitting this, but yes it could happen when you have CONFIG_DEBUG_VM=y > > (which I too like to run with, but would not recommend for users). > > Yep, but for large cluster and pinpointed workload this happens surprisingly > frequently =) I've believed into this race only after seeing statistics for > count of compactions and how it correlates with incidents. > > Probably the key component is a slab allocation from network irq/bh context > which interrupts compaction exactly at this spot. Yes, I bet you're right. > > > > > > > > > > > > page:ffffea0105ca9380 count:1 mapcount:0 mapping:ffff88ff7712c180 > > > index:0x0 compound_mapcount: 0 > > > flags: 0x500000000008100(slab|head) > > > raw: 0500000000008100 dead000000000100 dead000000000200 > > > ffff88ff7712c180 > > > raw: 0000000000000000 0000000080200020 00000001ffffffff > > > 0000000000000000 > > > page dumped because: VM_BUG_ON_PAGE(PageSlab(page)) > > > ------------[ cut here ]------------ > > > kernel BUG at ./include/linux/mm.h:628! > > > invalid opcode: 0000 [#1] SMP NOPTI > > > CPU: 77 PID: 504 Comm: kcompactd1 Tainted: G W > > > 4.19.109-27 #1 > > > Hardware name: Yandex T175-N41-Y3N/MY81-EX0-Y3N, BIOS R05 06/20/2019 > > > RIP: 0010:isolate_migratepages_block+0x986/0x9b0 > > > > > > > > > To fix just opencode page_mapcount() in racy check for 0-order case and > > > recheck carefully under lru_lock when page cannot escape from lru. > > > > > > Also add checking extra references for file pages and swap cache. > > > > > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > > > Fixes: 119d6d59dcc0 ("mm, compaction: avoid isolating pinned pages") > > > > Not really, that commit was correct at the time it went in. > > > > > Fixes: 1d148e218a0d ("mm: add VM_BUG_ON_PAGE() to page_mapcount()") > > > > Exactly, that commit was well-intentioned, but did not allow for this > > (admittedly very exceptional) usage. How many developers actually > > make the mistake of applying page_mapcount() to their slab pages? > > None, I expect. That VM_BUG_ON_PAGE() is there for documentation, > > and could just be replaced by a comment - and Linus would be happy > > with that. > > Ok, I'll redo the fix in this way. Thanks. > > > > > > --- > > > mm/compaction.c | 17 +++++++++++++---- > > > 1 file changed, 13 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/compaction.c b/mm/compaction.c > > > index 46f0fcc93081..91bb87fd9420 100644 > > > --- a/mm/compaction.c > > > +++ b/mm/compaction.c > > > @@ -935,12 +935,16 @@ isolate_migratepages_block(struct compact_control > > > *cc, unsigned long low_pfn, > > > } > > > /* > > > - * Migration will fail if an anonymous page is pinned in > > > memory, > > > + * Migration will fail if an page is pinned in memory, > > > * so avoid taking lru_lock and isolating it unnecessarily in > > > an > > > - * admittedly racy check. > > > + * admittedly racy check simplest case for 0-order pages. > > > + * > > > + * Open code page_mapcount() to avoid > > > VM_BUG_ON(PageSlab(page)). > > > > But open coding page_mapcount() is not all that you did. You have > > (understandably) chosen to avoid calling page_mapping(page), but... > > > > > + * Page could have extra reference from mapping or swap > > > cache. > > > */ > > > - if (!page_mapping(page) && > > > - page_count(page) > page_mapcount(page)) > > > + if (!PageCompound(page) && > > > + page_count(page) > atomic_read(&page->_mapcount) + 1 + > > > + (!PageAnon(page) || PageSwapCache(page))) > > > goto isolate_fail; > > > > Isn't that test going to send all the file cache pages with buffer heads > > in page->private, off to isolate_fail when they're actually great > > candidates for migration? > > Yes. What a shame. Adding page_has_private() could fix that? > > Kind of > > page_count(page) > page_mapcount(page) + > (PageAnon(page) ? PageSwapCache(page) : (1 + page_has_private(page))) Certainly it was fixable, but I'm too lazy to want to think through the correct answer; and though I'm often out of sympathy with helper functions (why do people want an inline bool function for every simple flag test?!?!), here is a place that cries out for a helper, if you complicate it beyond page_count > page_mapcount (especially when driven into that detail of adding 1 to _mapcount). > > or probably something like this: > > page_count(page) > page_mapcount(page) + > (PageAnon(page) ? PageSwapCache(page) : GUP_PIN_COUNTING_BIAS) > > I.e. skip only file pages pinned by dma or something slower. > I see some movements in this direction in recent changes. > > of course that's independent matter. Yes, once the gup/pin conversion is widespread, I expect that it will allow a better implementation of this compaction test, one not limited to the anonymous pages. (We do internally use a patch extending the current test to file pages, which in practice has saved a lot of time wasted on failing compactions: but, last I looked anyway, it gets some cases wrong - cases we happen not to care about ourselves, but would be unacceptable upstream. So I hope the distinction of pinned pages will work out well here later.) Hugh