On 3/12/21 12:15 AM, Michal Hocko wrote: > On Thu 11-03-21 14:53:08, Mike Kravetz wrote: >> On 3/11/21 9:59 AM, Mike Kravetz wrote: >>> On 3/11/21 4:17 AM, Michal Hocko wrote: >>>>> Yeah per cpu preempt counting shouldn't be noticeable but I have to >>>>> confess I haven't benchmarked it. >>>> >>>> But all this seems moot now http://lkml.kernel.org/r/YEoA08n60+jzsnAl@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx >>>> >>> >>> The proper fix for free_huge_page independent of this series would >>> involve: >>> >>> - Make hugetlb_lock and subpool lock irq safe >>> - Hand off freeing to a workque if the freeing could sleep >>> >>> Today, the only time we can sleep in free_huge_page is for gigantic >>> pages allocated via cma. I 'think' the concern about undesirable >>> user visible side effects in this case is minimal as freeing/allocating >>> 1G pages is not something that is going to happen at a high frequency. >>> My thinking could be wrong? >>> >>> Of more concern, is the introduction of this series. If this feature >>> is enabled, then ALL free_huge_page requests must be sent to a workqueue. >>> Any ideas on how to address this? >>> >> >> Thinking about this more ... >> >> A call to free_huge_page has two distinct outcomes >> 1) Page is freed back to the original allocator: buddy or cma >> 2) Page is put on hugetlb free list >> >> We can only possibly sleep in the first case 1. In addition, freeing a >> page back to the original allocator involves these steps: >> 1) Removing page from hugetlb lists >> 2) Updating hugetlb counts: nr_hugepages, surplus >> 3) Updating page fields >> 4) Allocate vmemmap pages if needed as in this series >> 5) Calling free routine of original allocator >> >> If hugetlb_lock is irq safe, we can perform the first 3 steps under that >> lock without issue. We would then use a workqueue to perform the last >> two steps. Since we are updating hugetlb user visible data under the >> lock, there should be no delays. Of course, giving those pages back to >> the original allocator could still be delayed, and a user may notice >> that. Not sure if that would be acceptable? > > Well, having many in-flight huge pages can certainly be visible. Say you > are freeing hundreds of huge pages and your echo n > nr_hugepages will > return just for you to find out that the memory hasn't been freed and > therefore cannot be reused for another use - recently there was somebody > mentioning their usecase to free up huge pages to prevent OOM for > example. I do expect more people doing something like that. > > Now, nr_hugepages can be handled by blocking on the same WQ until all > pre-existing items are processed. Maybe we will need to have a more > generic API to achieve the same for in kernel users but let's wait for > those requests. > >> I think Muchun had a >> similar setup just for vmemmmap allocation in an early version of this >> series. >> >> This would also require changes to where accounting is done in >> dissolve_free_huge_page and update_and_free_page as mentioned elsewhere. > > Normalizing dissolve_free_huge_page is definitely a good idea. It is > really tricky how it sticks out and does half of the job of > update_and_free_page. > > That being said, if it is possible to have a fully consistent h state > before handing over to WQ for sleeping operation then we should be all > fine. I am slightly worried about potential tricky situations where the > sleeping operation fails because that would require that page to be > added back to the pool again. As said above we would need some sort of > sync with in-flight operations before returning to the userspace. Those sysfs interfaces to allocate/free huge pages will need to be reworked. One thing that is totally unacceptable with hugetlb_lock being irq safe, are the calls to cond_resched_lock(&hugetlb_lock). We will need to significantly reduce lock hold time in these situations. I have some ideas on how this might work, but it is going to require some a good deal of code restructuring and will take some time. -- Mike Kravetz