On Tue, Aug 10, 2021 at 01:50:57PM -0700, Dave Hansen wrote: > On 8/10/21 11:13 AM, Dave Hansen wrote: > >> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone, > >> if (page_reported(page)) > >> __ClearPageReported(page); > >> > >> + if (PageOffline(page)) > >> + clear_page_offline(page, order); > >> + > >> list_del(&page->lru); > >> __ClearPageBuddy(page); > >> set_page_private(page, 0); > > So, this is right in the fast path of the page allocator. It's a > > one-time thing per 2M page, so it's not permanent. > > > > *But* there's both a global spinlock and a firmware call hidden in > > clear_page_offline(). That's *GOT* to hurt if you were, for instance, > > running a benchmark while this code path is being tickled. Not just to > > > > That could be just downright catastrophic for scalability, albeit > > temporarily. > > One more thing... > > How long are these calls? You have to make at least 512 calls into the > SEAM module. Assuming they're syscall-ish, so ~1,000 cycles each, > that's ~500,000 cycles, even if we ignore the actual time it takes to > zero that 2MB worth of memory and all other overhead within the SEAM module. I hope to get away with 2 calls per 2M: one MapGPA and one TDACCEPTPAGE (or 3 for MAXORDER -- 4M -- pages). I don't have any numbers yet. > So, we're sitting on one CPU with interrupts off, blocking all the other > CPUs from doing page allocation in this zone. I agree that's not good. Let's see if it's going to be okay with accepting in 2M chunks. > Then, we're holding a global lock which prevents any other NUMA nodes > from accepting pages. Looking at this again, the global lock is aviodable: the caller owns the pfn range so nobody can touch these bits in the bitmap. We can replace bitmap_clear() with atomic clear_bit() loop and drop the lock completely. > If the other node happens to *try* to do an > accept, it will sit with its zone lock held waiting for this one. > Maybe nobody will ever notice. But, it seems like an awfully big risk > to me. I'd at least *try* do these calls outside of the zone lock. > Then the collateral damage will at least be limited to things doing > accepts rather than all zone->lock users. > > Couldn't we delay the acceptance to, say the place where we've dropped > the zone->lock and do the __GFP_ZERO memset() like at prep_new_page()? > Or is there some concern that the page has been split at that point? It *will* be split by the point. Like if you ask for order-0 page and you don't any left page allocator will try higher orders until finds anything. On order-9 it would hit unaccepted. At that point the page going to split and put on the free lists accordingly. That's all happens under zone lock. __rmqueue_smallest -> del_page_from_free_list() expand() > I guess that makes it more complicated because you might have a 4k page > but you need to go accept a 2M page. You might end up having to check > the bitmap 511 more times because you might see 511 more PageOffline() > pages come through. > > You shouldn't even need the bitmap lock to read since it's a one-way > trip from unaccepted->accepted. Yeah. Unless we don't want to flip it back on making the range share. I think we do. Otherwise it will cause problems for kexec. -- Kirill A. Shutemov