On Fri 20-11-20 17:37:09, Muchun Song wrote: > On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Fri 20-11-20 16:51:59, Muchun Song wrote: > > > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > > > On Fri 20-11-20 14:43:15, Muchun Song wrote: > > > > [...] > > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > > > > index eda7e3a0b67c..361c4174e222 100644 > > > > > --- a/mm/hugetlb_vmemmap.c > > > > > +++ b/mm/hugetlb_vmemmap.c > > > > > @@ -117,6 +117,8 @@ > > > > > #define RESERVE_VMEMMAP_NR 2U > > > > > #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) > > > > > #define TAIL_PAGE_REUSE -1 > > > > > +#define GFP_VMEMMAP_PAGE \ > > > > > + (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC) > > > > > > > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory > > > > depletion. I am not even sure triggering the OOM killer is a reasonable > > > > behavior. It is just unexpected that shrinking a hugetlb pool can have > > > > destructive side effects. I believe it would be more reasonable to > > > > simply refuse to shrink the pool if we cannot free those pages up. This > > > > sucks as well but it isn't destructive at least. > > > > > > I find the instructions of __GFP_MEMALLOC from the kernel doc. > > > > > > %__GFP_MEMALLOC allows access to all memory. This should only be used when > > > the caller guarantees the allocation will allow more memory to be freed > > > very shortly. > > > > > > Our situation is in line with the description above. We will free a HugeTLB page > > > to the buddy allocator which is much larger than that we allocated shortly. > > > > Yes that is a part of the description. But read it in its full entirety. > > * %__GFP_MEMALLOC allows access to all memory. This should only be used when > > * the caller guarantees the allocation will allow more memory to be freed > > * very shortly e.g. process exiting or swapping. Users either should > > * be the MM or co-ordinating closely with the VM (e.g. swap over NFS). > > * Users of this flag have to be extremely careful to not deplete the reserve > > * completely and implement a throttling mechanism which controls the > > * consumption of the reserve based on the amount of freed memory. > > * Usage of a pre-allocated pool (e.g. mempool) should be always considered > > * before using this flag. > > > > GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH > > We want to free the HugeTLB page to the buddy allocator, but before that, > we need to allocate some pages as vmemmap pages, so here we cannot > handle allocation failures. Why cannot you simply refuse to shrink the pool size? > I think that we should replace the > __GFP_RETRY_MAYFAIL to __GFP_NOFAIL. > > GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH > > This meets our needs here. Thanks. Please read again my concern about the disruptive behavior or explain why it is desirable. -- Michal Hocko SUSE Labs