On 3/23/21 12:57 AM, Michal Hocko wrote: > On Mon 22-03-21 16:28:07, Mike Kravetz wrote: >> On 3/22/21 7:31 AM, Michal Hocko wrote: >>> On Fri 19-03-21 15:42:06, Mike Kravetz wrote: >>> [...] >>>> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h, >>>> while (nr_pages--) { >>>> h->resv_huge_pages--; >>>> unused_resv_pages--; >>>> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) >>>> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); >>>> + if (!page) >>>> goto out; >>>> - cond_resched_lock(&hugetlb_lock); >>>> + >>>> + /* Drop lock and free page to buddy as it could sleep */ >>>> + spin_unlock(&hugetlb_lock); >>>> + update_and_free_page(h, page); >>>> + cond_resched(); >>>> + spin_lock(&hugetlb_lock); >>>> } >>>> >>>> out: >>> >>> This is likely a matter of taste but the repeated pattern of unlock, >>> update_and_free_page, cond_resched and lock seems rather clumsy. >>> Would it be slightly better/nicer to remove_pool_huge_page into a >>> list_head under a single lock invocation and then free up the whole lot >>> after the lock is dropped? >> >> Yes, we can certainly do that. >> One downside I see is that the list can contain a bunch of pages not >> accounted for in hugetlb and not free in buddy (or cma). Ideally, we >> would want to keep those in sync if possible. Also, the commit that >> added the cond_resched talked about freeing up 12 TB worth of huge pages >> and it holding the lock for 150 seconds. The new code is not holding >> the lock while calling free to buddy, but I wonder how long it would >> take to remove 12 TB worth of huge pages and add them to a separate list? > > Well, the remove_pool_huge_page is just a accounting part and that > should be pretty invisible even when the number of pages is large. The > lockless nature (from hugetlb POV) of the final page release is the > heavy weight operation and whether you do it in chunks or in a single go > (with cond_resched) should be visible either. We already do the same > thing when uncharging memcg pages (mem_cgroup_uncharge_list). > > So I would agree with you that this would be a much bigger problem if > both the hugetlb and freeing path were equally heavy weight and the > delay between first pages uncaccounted and freed would be noticeable. > > But I do not want to push for this. I just hated the hugetlb_lock dances > as this is ugly and repetitive pattern. As you may have seen in my reply to patch 3, I am going to use this batching approach for all places we do remove/free hugetlb page. Since you brought up cgroups ... what is your opinion on lock hold time in hugetlb_cgroup_css_offline? We could potentially be calling hugetlb_cgroup_move_parent for every hugetlb page while holding the lock with interrupts disabled. -- Mike Kravetz