On 12/16/19 10:27 AM, Waiman Long wrote: > The following lockdep splat was observed when a certain hugetlbfs test > was run: <snip> > This patch implements the deferred freeing by adding a > free_hpage_workfn() work function to do the actual freeing. The > free_huge_page() call in a non-task context saves the page to be freed > in the hpage_freelist linked list in a lockless manner. > > The generic workqueue is used to process the work, but a dedicated > workqueue can be used instead if it is desirable to have the huge page > freed ASAP. > <snip> > > +/* > + * As free_huge_page() can be called from a non-task context, we have > + * to defer the actual freeing in a workqueue to prevent potential > + * hugetlb_lock deadlock. > + * > + * free_hpage_workfn() locklessly retrieves the linked list of pages to > + * be freed and frees them one-by-one. As the page->mapping pointer is > + * going to be cleared in __free_huge_page() anyway, it is reused as the > + * next pointer of a singly linked list of huge pages to be freed. > + */ > +#define NEXT_PENDING ((struct page *)-1) > +static struct page *hpage_freelist; > + > +static void free_hpage_workfn(struct work_struct *work) > +{ > + struct page *curr, *next; > + int cnt = 0; > + > + do { > + curr = xchg(&hpage_freelist, NULL); > + if (!curr) > + break; > + > + while (curr) { > + next = (struct page *)READ_ONCE(curr->mapping); > + if (next == NEXT_PENDING) { > + cpu_relax(); > + continue; > + } > + __free_huge_page(curr); > + curr = next; > + cnt++; > + } > + } while (!READ_ONCE(hpage_freelist)); > + > + if (!cnt) > + return; > + pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt); > +} > +static DECLARE_WORK(free_hpage_work, free_hpage_workfn); > + > +void free_huge_page(struct page *page) > +{ > + /* > + * Defer freeing if in non-task context to avoid hugetlb_lock deadlock. > + */ > + if (!in_task()) { > + struct page *next; > + > + page->mapping = (struct address_space *)NEXT_PENDING; > + next = xchg(&hpage_freelist, page); > + WRITE_ONCE(page->mapping, (struct address_space *)next); > + schedule_work(&free_hpage_work); > + return; > + } As Andrew mentioned, the design for the lockless queueing could use more explanation. I had to draw some diagrams before I felt relatively confident in the design. > + > + /* > + * Racing may prevent some deferred huge pages in hpage_freelist > + * from being freed. Check here and call schedule_work() if that > + * is the case. > + */ > + if (unlikely(hpage_freelist && !work_pending(&free_hpage_work))) > + schedule_work(&free_hpage_work); Can you describe the race which would leave deferred huge pages on hpage_freelist? I am having a hard time determining how that can happen. And, if this indeed can happen then I would have to ask what happens if a page is 'stuck' and we do not call free_huge_page? Do we need to take that case into account? Overall, I like the design and hope this will work. I have been testing a 'modified' version of the patch to always do the deferred freeing. The modification is simply to stress the code. So far, I have not found any issues in any of my testing. -- Mike Kravetz