Re: [PATCH 2/4] mm/vmalloc: add support for __GFP_NOFAIL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed 27-10-21 19:55:50, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote:
> > On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > > From: Michal Hocko <mhocko@xxxxxxxx>
> > > > >
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > > > >
> > > > > The larg part of the vmalloc implementation already complies with the
> > > > > given gfp flags so there is no work for those to be done. The area
> > > > > and page table allocations are an exception to that. Implement a retry
> > > > > loop for those.
> > > > >
> > > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > > a change to the vmalloc space change if the failure was caused by
> > > > > the space fragmentation or depletion. But there are multiple different
> > > > > reasons to retry and this could become much more complex. Keep the retry
> > > > > simple for now and just sleep to prevent from hogging CPUs.
> > > > >
> > > > > Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> > > > > ---
> > > > >  mm/vmalloc.c | 10 +++++++++-
> > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index c6cc77d2f366..602649919a9d 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > > >                 flags = memalloc_noio_save();
> > > > >
> > > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > > +       do {
> > > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > >                         page_shift);
> > > > > +               if (ret < 0)
> > > > > +                       schedule_timeout_uninterruptible(1);
> > > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > > >
> > > > 
> > > > 1.
> > > > After that change a below code:
> > > > 
> > > > <snip>
> > > > if (ret < 0) {
> > > >     warn_alloc(orig_gfp_mask, NULL,
> > > >         "vmalloc error: size %lu, failed to map pages",
> > > >         area->nr_pages * PAGE_SIZE);
> > > >     goto fail;
> > > > }
> > > > <snip>
> > > > 
> > > > does not make any sense anymore.
> > > 
> > > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > > 
> > Right. I meant one thing but wrote slightly differently. In case of
> > vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> > warning message? Because either we can recover on a future iteration
> > or it stuck there infinitely so a user does not understand what happened.
> > From the other hand this is how __GFP_NOFAIL works, hm..
> > 
> > Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> > for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> > not want to wait, instead we should return back to a caller asap. Or am
> > i missing something here?
> > 
> > > > 2.
> > > > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > > > That would look like as more sorted out.
> > > 
> > > I have to admit I am not really fluent at vmalloc code so I wanted to
> > > make the code as simple as possible. How would I unwind all the allocated
> > > memory (already allocated as GFP_NOFAIL) before retrying at
> > > __vmalloc_node_range (if that is what you suggest). And isn't that a
> > > bit wasteful?
> > > 
> > > Or did you have anything else in mind?
> > >
> > It depends on how often all this can fail. But let me double check if
> > such combining is easy.
> > 
> I mean something like below. The idea is to not spread the __GFP_NOFAIL
> across the vmalloc file keeping it in one solid place:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..f4b7927e217e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned long array_size;
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
> +	unsigned long flags;
> +	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> +
> +	/*
> +	 * This is i do not understand why we do not want to see warning messages.
> +	 */
>  	gfp_mask |= __GFP_NOWARN;

I suspect this is becauser vmalloc wants to have its own failure
reporting.

[...]
> @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
>  				  VM_UNINITIALIZED | vm_flags, start, end, node,
>  				  gfp_mask, caller);
> -	if (!area) {
> -		warn_alloc(gfp_mask, NULL,
> -			"vmalloc error: size %lu, vm_struct allocation failed",
> -			real_size);
> -		goto fail;
> -	}
> +	if (area)
> +		addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> +
> +	if (!area || !addr) {
> +		if (gfp_mask & __GFP_NOFAIL) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}
> +
> +		if (!area)
> +			warn_alloc(gfp_mask, NULL,
> +				"vmalloc error: size %lu, vm_struct allocation failed",
> +				real_size);
>  
> -	addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> -	if (!addr)
>  		goto fail;
> +	}
>  
>  	/*
>  	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
> <snip>

OK, this looks easier from the code reading but isn't it quite wasteful
to throw all the pages backing the area (all of them allocated as
__GFP_NOFAIL) just to then fail to allocate few page tables pages and
drop all of that on the floor (this will happen in __vunmap AFAICS).

I mean I do not care all that strongly but it seems to me that more
changes would need to be done here and optimizations can be done on top.

Is this something you feel strongly about?
-- 
Michal Hocko
SUSE Labs




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux