Re: [PATCH] bpf: Try harder when allocating memory for maps

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Fri, 8 Mar 2019 12:30:47 +0100

On 03/08/2019 11:55 AM, Michal Hocko wrote:
> On Fri 08-03-19 11:33:00, Daniel Borkmann wrote:
>> On 03/08/2019 09:44 AM, Michal Hocko wrote:
>>> On Fri 08-03-19 09:08:57, Martynas Pumputis wrote:
>>
>> Martynas, for the patch, please also Cc netdev in the submission so
>> that it lands properly in patchwork. Setup where patches only Cc'ed
>> to bpf@xxxxxxxxxxxxxxx would land in our delegate is not yet completed
>> by ozlabs folks, just fyi.
>>
>>>> It has been observed that sometimes memory allocation for BPF maps
>>>> fails when there is no obvious memory pressure in a system.
>>>>
>>>> E.g. the map (BPF_MAP_TYPE_LRU_HASH, key=38, value=56, max_elems=524288)
>>>> could not be created due to due to vmalloc unable to allocate 75497472B,
>>>> when the system's memory consumption (in MB) was the following:
>>>>
>>>>     Total: 3942 Used: 837 (21.24%) Free: 138 Buffers: 239 Cached: 2727
>>>
>>> Hmm 75MB is quite large and much larger than the slab/page allocator
>>> cann provide so this is not really a fragmentation issue. Vmalloc does
>>
>> Agree.
>>
>>> respect noretry but considering that there shouldn't be a large memory
>>> pressure I wonder how NORETRY managed to fail the allocation. Do you
>>> happen to have the allocation failure report?
>>
>> I'll defer to Martynas here.
>>
>>> Btw. is there any real reason to opencode and duplicate kvmalloc logic
>>> here? In other words why not simply make bpf_map_area_alloc use
>>> kvmalloc_node with GFP_KERNEL?
>>
>> Mostly historical reasons from d407bd25a204 ("bpf: don't trigger OOM killer
>> under pressure with map alloc"). I remember back then we had a discussion
>> that __GFP_NORETRY is not fully supported and should only be seen as a hint
>> in our case (since it's not propagated all the way through in vmalloc, if
>> I recall correctly).
> 
> Yes, that is still the case and there is no way to really have nooom
> semantic for vmalloc. Even with your opencoded version btw.

Okay, so similar situation applies to __GFP_RETRY_MAYFAIL reclaim modifier
then if I understand you correctly? In dcda9b0471, its mentioned "this
means that all the reclaim opportunities have been exhausted except the
most disruptive one (the OOM killer) and a user defined fallback behavior
is more sensible than keep retrying in the page allocator." In the api
comment in kvmalloc_node(), it says "__GFP_RETRY_MAYFAIL is supported,
and it should be used only if kmalloc is preferable to the vmalloc fallback,
due to visible performance drawbacks". But if you say above that for
vmalloc, there is no nooom semantic, then presumably __GFP_RETRY_MAYFAIL
should also be taken as a hint wrt OOM, not a guarantee for the given
allocation request (if kvmalloc_node selects the vmalloc based allocator),
is that correct?

>> And looking at kvmalloc_node(), __GFP_NORETRY is only
>> really set in case of kmalloc attempts. Given these alloc requests for maps
>> can often be large in size, what we really want is something that ideally under
>> *no* circumstances oom killer would trigger as that is way too disruptive.
> 
> That is not really possible. Even if you do not trigger the OOM killer
> directly, som concurrent allocation might do that because your
> particular one has eaten the remaining memory.

Yes, in that situation there's not much we can do with either modifier.

>> So
>> instead, allocation should just fail and bpf loader or whatnot can deal with
>> it. Looks like __GFP_RETRY_MAYFAIL would be better suited wrt OOM for both
>> allocators and would allow to reuse kvmalloc though it would try much harder
>> than __GFP_NORETRY.
> 
> Yes.
> 
>> Ideally something like GFP_KERNEL | __GFP_NOWARN |
>> __GFP_NOOOM | __GFP_ZERO would be nice to have, semantics of __GFP_RETRY_MAYFAIL
>> kind of gets closer to it from looking at dcda9b0471.
> 
> NOOOM semantic is simply impossible to make it sensible as explained
> above.
> 
>>>> Considering dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
>>>> __GFP_RETRY_MAYFAIL with more useful semantic") we can replace
>>>> __GFP_NORETRY with __GFP_RETRY_MAYFAIL, as it won't invoke OOM killer
>>>> and will try harder to fulfil allocation requests.
>>>>
>>>> The change has been tested with the workloads mentioned above and by
>>>> observing oom_kill value from /proc/vmstat.
>>>>
>>>> Signed-off-by: Martynas Pumputis <m@xxxxxxxxx>
>>>> ---
>>>>  kernel/bpf/syscall.c | 8 ++++----
>>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 62f6bced3a3c..eb5cefe44af3 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -136,11 +136,11 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>>>>  
>>>>  void *bpf_map_area_alloc(size_t size, int numa_node)
>>>>  {
>>>> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
>>>> -	 * trigger under memory pressure as we really just want to
>>>> -	 * fail instead.
>>>> +	/* We definitely need __GFP_NORETRY or __GFP_RETRY_MAYFAIL, so
>>>> +	 * OOM killer doesn't trigger under memory pressure as we really
>>>> +	 * just want to fail instead.
>>>>  	 */
>>>> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>>>> +	const gfp_t flags = __GFP_NOWARN | __GFP_RETRY_MAYFAIL | __GFP_ZERO;
>>>>  	void *area;
>>>>  
>>>>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>> -- 
>>>> 2.21.0
>>>>
>>>
>