On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote: > On 18.08.20 01:34, Minchan Kim wrote: > > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote: > >> On 17.08.20 18:30, Minchan Kim wrote: > >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote: > >>>> On 17.08.20 17:27, Minchan Kim wrote: > >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote: > >>>>>> On 14.08.20 19:31, Minchan Kim wrote: > >>>>>>> There is a need for special HW to require bulk allocation of > >>>>>>> high-order pages. For example, 4800 * order-4 pages. > >>>>>>> > >>>>>>> To meet the requirement, a option is using CMA area because > >>>>>>> page allocator with compaction under memory pressure is > >>>>>>> easily failed to meet the requirement and too slow for 4800 > >>>>>>> times. However, CMA has also the following drawbacks: > >>>>>>> > >>>>>>> * 4800 of order-4 * cma_alloc is too slow > >>>>>>> > >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous > >>>>>>> memory once and then split them into order-4 chunks. > >>>>>>> The problem of this approach is CMA allocation fails one of the > >>>>>>> pages in those range couldn't migrate out, which happens easily > >>>>>>> with fs write under memory pressure. > >>>>>> > >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1 > >>>>>> chunks and split them. That would already heavily reduce the call frequency. > >>>>> > >>>>> I think you meant this: > >>>>> > >>>>> alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1) > >>>>> > >>>>> It would work if system has lots of non-fragmented free memory. > >>>>> However, once they are fragmented, it doesn't work. That's why we have > >>>>> seen even order-4 allocation failure in the field easily and that's why > >>>>> CMA was there. > >>>>> > >>>>> CMA has more logics to isolate the memory during allocation/freeing as > >>>>> well as fragmentation avoidance so that it has less chance to be stealed > >>>>> from others and increase high success ratio. That's why I want this API > >>>>> to be used with CMA or movable zone. > >>>> > >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one > >>>> big 300M allocation. As you correctly note, memory placed into CMA > >>>> should be movable, except for (short/long) term pinnings. In these > >>>> cases, doing allocations smaller than 300M and splitting them up should > >>>> be good enough to reduce the call frequency, no? > >>> > >>> I should have written that. The 300M I mentioned is really minimum size. > >>> In some scenraio, we need way bigger than 300M, up to several GB. > >>> Furthermore, the demand would be increased in near future. > >> > >> And what will the driver do with that data besides providing it to the > >> device? Can it be mapped to user space? I think we really need more > >> information / the actual user. > >> > >>>> > >>>>> > >>>>> A usecase is device can set a exclusive CMA area up when system boots. > >>>>> When device needs 4800 * order-4 pages, it could call this bulk against > >>>>> of the area so that it could effectively be guaranteed to allocate > >>>>> enough fast. > >>>> > >>>> Just wondering > >>>> > >>>> a) Why does it have to be fast? > >>> > >>> That's because it's related to application latency, which ends up > >>> user feel bad. > >> > >> Okay, but in theory, your device-needs are very similar to > >> application-needs, besides you requiring order-4 pages, correct? Similar > >> to an application that starts up and pins 300M (or more), just with > >> ordr-4 pages. > > > > Yes. > > > >> > >> I don't get quite yet why you need a range allocator for that. Because > >> you intend to use CMA? > > > > Yes, with CMA, it could be more guaranteed and fast enough with little > > tweaking. Currently, CMA is too slow due to below IPI overheads. > > > > 1. set_migratetype_isolate does drain_all_pages for every pageblock. > > 2. __aloc_contig_migrate_range does migrate_prep > > 3. alloc_contig_range does lru_add_drain_all. > > > > Thus, if we need to increase frequency of call as your suggestion, > > the set up overhead is also scaling up depending on the size. > > Such overhead makes sense if caller requests big contiguous memory > > but it's too much for normal high-order allocations. > > > > Maybe, we might optimize those call sites to reduce or/remove > > frequency of those IPI call smarter way but that would need to deal > > with success ratio vs fastness dance in the end. > > > > Another concern to use existing cma API is it's trying to make > > allocation successful at the cost of latency. For example, waiting > > a page writeback. > > > > That's the this new sematic API comes from for compromise since I believe > > we need some way to separate original CMA alloc(biased to be guaranteed > > but slower) with this new API(biased to be fast but less guaranteed). > > > > Is there any idea to work without existing cma API tweaking? > > Let me try to summarize: > > 1. Your driver needs a lot of order-4 pages. And it's needs them fast, > because of observerable lag/delay in an application. The pages will be > unmovable by the driver. > > 2. Your idea is to use CMA, as that avoids unmovable allocations, > theoretically allowing you to allocate all memory. But you don't > actually want a large contiguous memory area. > > 3. Doing a whole bunch of order-4 cma allocations is slow. > > 4. Doing a single large cma allocation and splitting it manually in the > caller can fail easily due to temporary page pinnings. > > > Regarding 4., [1] comes to mind, which has the same issues with > temporary page pinnings and solves it by simply retrying. Yeah, there > will be some lag, but maybe it's overall faster than doing separate > order-4 cma allocations? Thanks for the pointer. However, it's not a single reason to make CMA failure. Historically, there are various potentail problems to make "temporal" as "non-temporal" like page write, indirect dependency between objects. > > In general, proactive compaction [2] comes to mind, does that help? I think it makes sense if such high-order allocation are dominant in the system workload because the benefit caused by TLB would be bigger than cost caused by frequent migration overhead. However, it's not the our usecase. > > [1] > https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@xxxxxxxxxxxxxx/ > [2] https://nitingupta.dev/post/proactive-compaction/ > I understand pfn stuff in the API is not pretty but the concept of idea makes sense to me in that go though the *migratable area* and get possible order pages with hard effort. It looks like GFP_NORETRY version for kmem_cache_alloc_bulk. How about this? int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);