On Thu, 3 Sep 2020 19:36:26 +0200 David Hildenbrand <david@xxxxxxxxxx> wrote: > (still on vacation, back next week on Tuesday) > > I didn't look into discussions in v1, but to me this looks like we are > trying to hide an actual bug by implementing hacks in the caller > (repeated calls to drain_all_pages()). What about alloc_contig_range() > users - you get more allocation errors just because PCP code doesn't > play along. > > There *is* strong synchronization with the page allocator - however, > there seems to be one corner case race where we allow to allocate pages > from isolated pageblocks. > > I want that fixed instead if possible, otherwise this is just an ugly > hack to make the obvious symptoms (offlining looping forever) disappear. > > If that is not possible easily, I'd much rather want to see all > drain_all_pages() calls being moved to the caller and have the expected > behavior documented instead of specifying "there is no strong > synchronization with the page allocator" - which is wrong in all but PCP > cases (and there only in one possible race?). > It's a two-line hack which fixes a bug in -stable kernels, so I'm inclined to proceed with it anyway. We can undo it later on as part of a better fix, OK? Unless you think there's some new misbehaviour which we might see as a result of this approach?