On Mon, Jun 03, 2019 at 10:53:34PM +0200, Uladzislau Rezki wrote: > On Mon, Jun 03, 2019 at 07:53:12PM +0200, Uladzislau Rezki wrote: > > Hello, Roman! > > > > On Wed, May 29, 2019 at 04:34:40PM +0000, Roman Gushchin wrote: > > > On Wed, May 29, 2019 at 04:27:15PM +0200, Uladzislau Rezki wrote: > > > > Hello, Roman! > > > > > > > > > On Mon, May 27, 2019 at 11:38:40AM +0200, Uladzislau Rezki (Sony) wrote: > > > > > > Refactor the NE_FIT_TYPE split case when it comes to an > > > > > > allocation of one extra object. We need it in order to > > > > > > build a remaining space. > > > > > > > > > > > > Introduce ne_fit_preload()/ne_fit_preload_end() functions > > > > > > for preloading one extra vmap_area object to ensure that > > > > > > we have it available when fit type is NE_FIT_TYPE. > > > > > > > > > > > > The preload is done per CPU in non-atomic context thus with > > > > > > GFP_KERNEL allocation masks. More permissive parameters can > > > > > > be beneficial for systems which are suffer from high memory > > > > > > pressure or low memory condition. > > > > > > > > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx> > > > > > > --- > > > > > > mm/vmalloc.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > > > > > > 1 file changed, 76 insertions(+), 3 deletions(-) > > > > > > > > > > Hi Uladzislau! > > > > > > > > > > This patch generally looks good to me (see some nits below), > > > > > but it would be really great to add some motivation, e.g. numbers. > > > > > > > > > The main goal of this patch to get rid of using GFP_NOWAIT since it is > > > > more restricted due to allocation from atomic context. IMHO, if we can > > > > avoid of using it that is a right way to go. > > > > > > > > From the other hand, as i mentioned before i have not seen any issues > > > > with that on all my test systems during big rework. But it could be > > > > beneficial for tiny systems where we do not have any swap and are > > > > limited in memory size. > > > > > > Ok, that makes sense to me. Is it possible to emulate such a tiny system > > > on kvm and measure the benefits? Again, not a strong opinion here, > > > but it will be easier to justify adding a good chunk of code. > > > > > It seems it is not so straightforward as it looks like. I tried it before, > > but usually the systems gets panic due to out of memory or just invokes > > the OOM killer. > > > > I will upload a new version of it, where i embed "preloading" logic directly > > into alloc_vmap_area() function. > > > just managed to simulate the faulty behavior of GFP_NOWAIT restriction, > resulting to failure of vmalloc allocation. Under heavy load and low > memory condition and without swap, i can trigger below warning on my > KVM machine: > > <snip> > [ 366.910037] Out of memory: Killed process 470 (bash) total-vm:21012kB, anon-rss:1700kB, file-rss:264kB, shmem-rss:0kB > [ 366.910692] oom_reaper: reaped process 470 (bash), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > [ 367.913199] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 > [ 367.913206] CPU: 3 PID: 19951 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #999 > [ 367.913207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 367.913208] Call Trace: > [ 367.913215] dump_stack+0x5c/0x7b > [ 367.913219] warn_alloc+0x108/0x190 > [ 367.913222] __alloc_pages_slowpath+0xdc7/0xdf0 > [ 367.913226] __alloc_pages_nodemask+0x2de/0x330 > [ 367.913230] cache_grow_begin+0x77/0x420 > [ 367.913232] fallback_alloc+0x161/0x200 > [ 367.913235] kmem_cache_alloc+0x1c9/0x570 > [ 367.913237] alloc_vmap_area+0x98b/0xa20 > [ 367.913240] __get_vm_area_node+0xb0/0x170 > [ 367.913243] __vmalloc_node_range+0x6d/0x230 > [ 367.913246] ? _do_fork+0xce/0x3d0 > [ 367.913248] copy_process.part.46+0x850/0x1b90 > [ 367.913250] ? _do_fork+0xce/0x3d0 > [ 367.913254] _do_fork+0xce/0x3d0 > [ 367.913257] ? __do_page_fault+0x2bf/0x4e0 > [ 367.913260] do_syscall_64+0x55/0x130 > [ 367.913263] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 367.913265] RIP: 0033:0x7f2a8248d38b > [ 367.913268] Code: db 45 85 f6 0f 85 95 01 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 de 00 00 00 85 c0 41 89 c5 0f 85 e5 00 00 > [ 367.913269] RSP: 002b:00007fff1b058c30 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 > [ 367.913271] RAX: ffffffffffffffda RBX: 00007fff1b058c30 RCX: 00007f2a8248d38b > [ 367.913272] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 > [ 367.913273] RBP: 00007fff1b058c80 R08: 00007f2a83d34300 R09: 00007fff1b1890a0 > [ 367.913274] R10: 00007f2a83d345d0 R11: 0000000000000246 R12: 0000000000000000 > [ 367.913275] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000 > [ 367.913278] Mem-Info: > [ 367.913282] active_anon:45795 inactive_anon:80706 isolated_anon:0 > active_file:394 inactive_file:359 isolated_file:210 > unevictable:2 dirty:0 writeback:0 unstable:0 > slab_reclaimable:2691 slab_unreclaimable:21864 > mapped:80835 shmem:80740 pagetables:50422 bounce:0 > free:12185 free_pcp:776 free_cma:0 > [ 367.913286] Node 0 active_anon:183180kB inactive_anon:322824kB active_file:1576kB inactive_file:1436kB unevictable:8kB isolated(anon):0kB isolated(file):840kB mapped:323340kB dirty:0kB writeback:0kB shmem:322960kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > [ 367.913287] Node 0 DMA free:4516kB min:724kB low:904kB high:1084kB active_anon:2384kB inactive_anon:0kB active_file:48kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:1256kB pagetables:4516kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > [ 367.913292] lowmem_reserve[]: 0 948 948 948 > [ 367.913294] Node 0 DMA32 free:44224kB min:44328kB low:55408kB high:66488kB active_anon:180252kB inactive_anon:322824kB active_file:992kB inactive_file:1332kB unevictable:8kB writepending:252kB present:1032064kB managed:995428kB mlocked:8kB kernel_stack:43260kB pagetables:197172kB bounce:0kB free_pcp:3252kB local_pcp:480kB free_cma:0kB > [ 367.913299] lowmem_reserve[]: 0 0 0 0 > [ 367.913301] Node 0 DMA: 46*4kB (UM) 45*8kB (UM) 12*16kB (UM) 9*32kB (UM) 2*64kB (M) 2*128kB (UM) 2*256kB (M) 3*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 4480kB > [ 367.913310] Node 0 DMA32: 966*4kB (UE) 552*8kB (UME) 648*16kB (UME) 265*32kB (UME) 75*64kB (UME) 12*128kB (ME) 1*256kB (U) 1*512kB (E) 1*1024kB (U) 2*2048kB (UM) 1*4096kB (M) = 43448kB > [ 367.913322] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > [ 367.913323] 81750 total pagecache pages > [ 367.913324] 0 pages in swap cache > [ 367.913325] Swap cache stats: add 0, delete 0, find 0/0 > [ 367.913325] Free swap = 0kB > [ 367.913326] Total swap = 0kB > [ 367.913327] 262014 pages RAM > [ 367.913327] 0 pages HighMem/MovableOnly > [ 367.913328] 9180 pages reserved > [ 367.913329] 0 pages hwpoisoned > [ 372.338733] systemd-journald[195]: /dev/kmsg buffer overrun, some messages lost. > <snip> > > Whereas with "preload" logic i see only OOM killer related messages: > > <snip> > [ 136.787266] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=systemd-journal,pid=196,uid=0 > [ 136.787276] Out of memory: Killed process 196 (systemd-journal) total-vm:56832kB, anon-rss:512kB, file-rss:336kB, shmem-rss:820kB > [ 136.790481] oom_reaper: reaped process 196 (systemd-journal), now anon-rss:0kB, file-rss:0kB, shmem-rss:820kB > <snip> > > i.e. vmalloc still able to allocate. > > Probably i need to update the commit message by this simulation and finding. Ah, perfect! Than it makes total sense to me. Thanks!