> Am 03.09.2020 um 21:31 schrieb Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>: > > On Thu, 3 Sep 2020 19:36:26 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote: > >> (still on vacation, back next week on Tuesday) >> >> I didn't look into discussions in v1, but to me this looks like we are >> trying to hide an actual bug by implementing hacks in the caller >> (repeated calls to drain_all_pages()). What about alloc_contig_range() >> users - you get more allocation errors just because PCP code doesn't >> play along. >> >> There *is* strong synchronization with the page allocator - however, >> there seems to be one corner case race where we allow to allocate pages >> from isolated pageblocks. >> >> I want that fixed instead if possible, otherwise this is just an ugly >> hack to make the obvious symptoms (offlining looping forever) disappear. >> >> If that is not possible easily, I'd much rather want to see all >> drain_all_pages() calls being moved to the caller and have the expected >> behavior documented instead of specifying "there is no strong >> synchronization with the page allocator" - which is wrong in all but PCP >> cases (and there only in one possible race?). >> > > It's a two-line hack which fixes a bug in -stable kernels, so I'm > inclined to proceed with it anyway. We can undo it later on as part of > a better fix, OK? Agreed as a stable fix, but I really want to see a proper fix (e.g., disabling PCP while having isolated pageblocks) on top. > > Unless you think there's some new misbehaviour which we might see as a > result of this approach? > We basically disable PCP by keeping to flush it. But performance shouldn‘t matter.