On Fri 08-03-19 12:30:47, Daniel Borkmann wrote: > On 03/08/2019 11:55 AM, Michal Hocko wrote: > > On Fri 08-03-19 11:33:00, Daniel Borkmann wrote: > >> On 03/08/2019 09:44 AM, Michal Hocko wrote: > >>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote: > >> > >> Martynas, for the patch, please also Cc netdev in the submission so > >> that it lands properly in patchwork. Setup where patches only Cc'ed > >> to bpf@xxxxxxxxxxxxxxx would land in our delegate is not yet completed > >> by ozlabs folks, just fyi. > >> > >>>> It has been observed that sometimes memory allocation for BPF maps > >>>> fails when there is no obvious memory pressure in a system. > >>>> > >>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288) > >>>> could not be created due to due to vmalloc unable to allocate 75497472B, > >>>> when the system's memory consumption (in MB) was the following: > >>>> > >>>> Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727 > >>> > >>> Hmm 75MB is quite large and much larger than the slab/page allocator > >>> cann provide so this is not really a fragmentation issue. Vmalloc does > >> > >> Agree. > >> > >>> respect noretry but considering that there shouldn't be a large memory > >>> pressure I wonder how NORETRY managed to fail the allocation. Do you > >>> happen to have the allocation failure report? > >> > >> I'll defer to Martynas here. > >> > >>> Btw. is there any real reason to opencode and duplicate kvmalloc logic > >>> here? In other words why not simply make bpf_map_area_alloc use > >>> kvmalloc_node with GFP_KERNEL? > >> > >> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer > >> under pressure with map alloc"). I remember back then we had a discussion > >> that __GFP_NORETRY is not fully supported and should only be seen as a hint > >> in our case (since it's not propagated all the way through in vmalloc, if > >> I recall correctly). > > > > Yes, that is still the case and there is no way to really have nooom > > semantic for vmalloc. Even with your opencoded version btw. > > Okay, so similar situation applies to __GFP_RETRY_MAYFAIL reclaim modifier > then if I understand you correctly? In dcda9b0471, its mentioned "this > means that all the reclaim opportunities have been exhausted except the > most disruptive one (the OOM killer) and a user defined fallback behavior > is more sensible than keep retrying in the page allocator." In the api > comment in kvmalloc_node(), it says "__GFP_RETRY_MAYFAIL is supported, > and it should be used only if kmalloc is preferable to the vmalloc fallback, > due to visible performance drawbacks". But if you say above that for > vmalloc, there is no nooom semantic, then presumably __GFP_RETRY_MAYFAIL > should also be taken as a hint wrt OOM, not a guarantee for the given > allocation request (if kvmalloc_node selects the vmalloc based allocator), > is that correct? Yes, kvmalloc* doesn't have the full MAYFAIL/NORETRY semantic because this is not possible to implement without reworking the whole page table allocation code or making both behave like scoped NOFS/NOIO. But I believe you shouldn't really have to care. Does the same problem really happen with a plain kvmalloc(GFP_KERNEL)? -- Michal Hocko SUSE Labs