Re: [RFC 0/7] Support high-order page bulk allocation

Minchan Kim <minchan@xxxxxxxxxx> · Tue, 18 Aug 2020 08:15:43 -0700

On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote:
> On 18.08.20 01:34, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 18:30, Minchan Kim wrote:
> >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >>>> On 17.08.20 17:27, Minchan Kim wrote:
> >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>>>> There is a need for special HW to require bulk allocation of
> >>>>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>>>
> >>>>>>> To meet the requirement, a option is using CMA area because
> >>>>>>> page allocator with compaction under memory pressure is
> >>>>>>> easily failed to meet the requirement and too slow for 4800
> >>>>>>> times. However, CMA has also the following drawbacks:
> >>>>>>>
> >>>>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>>>
> >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>>>> memory once and then split them into order-4 chunks.
> >>>>>>> The problem of this approach is CMA allocation fails one of the
> >>>>>>> pages in those range couldn't migrate out, which happens easily
> >>>>>>> with fs write under memory pressure.
> >>>>>>
> >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>>>
> >>>>> I think you meant this:
> >>>>>
> >>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>>>
> >>>>> It would work if system has lots of non-fragmented free memory.
> >>>>> However, once they are fragmented, it doesn't work. That's why we have
> >>>>> seen even order-4 allocation failure in the field easily and that's why
> >>>>> CMA was there.
> >>>>>
> >>>>> CMA has more logics to isolate the memory during allocation/freeing as
> >>>>> well as fragmentation avoidance so that it has less chance to be stealed
> >>>>> from others and increase high success ratio. That's why I want this API
> >>>>> to be used with CMA or movable zone.
> >>>>
> >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >>>> big 300M allocation. As you correctly note, memory placed into CMA
> >>>> should be movable, except for (short/long) term pinnings. In these
> >>>> cases, doing allocations smaller than 300M and splitting them up should
> >>>> be good enough to reduce the call frequency, no?
> >>>
> >>> I should have written that. The 300M I mentioned is really minimum size.
> >>> In some scenraio, we need way bigger than 300M, up to several GB.
> >>> Furthermore, the demand would be increased in near future.
> >>
> >> And what will the driver do with that data besides providing it to the
> >> device? Can it be mapped to user space? I think we really need more
> >> information / the actual user.
> >>
> >>>>
> >>>>>
> >>>>> A usecase is device can set a exclusive CMA area up when system boots.
> >>>>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>>>> of the area so that it could effectively be guaranteed to allocate
> >>>>> enough fast.
> >>>>
> >>>> Just wondering
> >>>>
> >>>> a) Why does it have to be fast?
> >>>
> >>> That's because it's related to application latency, which ends up
> >>> user feel bad.
> >>
> >> Okay, but in theory, your device-needs are very similar to
> >> application-needs, besides you requiring order-4 pages, correct? Similar
> >> to an application that starts up and pins 300M (or more), just with
> >> ordr-4 pages.
> > 
> > Yes.
> > 
> >>
> >> I don't get quite yet why you need a range allocator for that. Because
> >> you intend to use CMA?
> > 
> > Yes, with CMA, it could be more guaranteed and fast enough with little
> > tweaking. Currently, CMA is too slow due to below IPI overheads.
> > 
> > 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> > 2. __aloc_contig_migrate_range does migrate_prep
> > 3. alloc_contig_range does lru_add_drain_all.
> > 
> > Thus, if we need to increase frequency of call as your suggestion,
> > the set up overhead is also scaling up depending on the size.
> > Such overhead makes sense if caller requests big contiguous memory
> > but it's too much for normal high-order allocations.
> > 
> > Maybe, we might optimize those call sites to reduce or/remove
> > frequency of those IPI call smarter way but that would need to deal
> > with success ratio vs fastness dance in the end.
> > 
> > Another concern to use existing cma API is it's trying to make
> > allocation successful at the cost of latency. For example, waiting
> > a page writeback.
> > 
> > That's the this new sematic API comes from for compromise since I believe
> > we need some way to separate original CMA alloc(biased to be guaranteed
> > but slower) with this new API(biased to be fast but less guaranteed).
> > 
> > Is there any idea to work without existing cma API tweaking?
> 
> Let me try to summarize:
> 
> 1. Your driver needs a lot of order-4 pages. And it's needs them fast,
> because of observerable lag/delay in an application. The pages will be
> unmovable by the driver.
> 
> 2. Your idea is to use CMA, as that avoids unmovable allocations,
> theoretically allowing you to allocate all memory. But you don't
> actually want a large contiguous memory area.
> 
> 3. Doing a whole bunch of order-4 cma allocations is slow.
> 
> 4. Doing a single large cma allocation and splitting it manually in the
> caller can fail easily due to temporary page pinnings.
> 
> 
> Regarding 4., [1] comes to mind, which has the same issues with
> temporary page pinnings and solves it by simply retrying. Yeah, there
> will be some lag, but maybe it's overall faster than doing separate
> order-4 cma allocations?

Thanks for the pointer. However, it's not a single reason to make CMA
failure. Historically, there are various potentail problems to make
"temporal" as "non-temporal" like page write, indirect dependency
between objects.

> 
> In general, proactive compaction [2] comes to mind, does that help?

I think it makes sense if such high-order allocation are dominant in the
system workload because the benefit caused by TLB would be bigger than cost
caused by frequent migration overhead. However, it's not the our usecase.

> 
> [1]
> https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@xxxxxxxxxxxxxx/
> [2] https://nitingupta.dev/post/proactive-compaction/
> 

I understand pfn stuff in the API is not pretty but the concept of idea
makes sense to me in that go though the *migratable area* and get possible
order pages with hard effort. It looks like GFP_NORETRY version for
kmem_cache_alloc_bulk.

How about this?

    int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);