On Mon, Jan 20, 2025 at 9:01 PM <jane.chu@xxxxxxxxxx> wrote: > > > On 1/20/2025 5:21 PM, Jiaqi Yan wrote: > > On Mon, Jan 20, 2025 at 2:59 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >> On 19.01.25 19:06, Jiaqi Yan wrote: > >>> While I was working on userspace MFR via memfd [1], I spend some time to > >>> understand what current kernel does when a HugeTLB-backing memfd is > >>> truncated. My expectation is, if there is a HWPoison HugeTLB folio > >>> mapped via the memfd to userspace, it will be unmapped right away but > >>> still be kept in page cache [2]; however when the memfd is truncated to > >>> zero or after the memfd is closed, kernel should dissolve the HWPoison > >>> folio in the page cache, and free only the clean raw pages to buddy > >>> allocator, excluding the poisoned raw page. > >>> > >>> So I wrote a hugetlb-mfr-base.c selftest and expect > >>> 0. say nr_hugepages initially is 64 as system configuration. > >>> 1. after MADV_HWPOISON, nr_hugepages should still be 64 as we kept even > >>> HWPoison huge folio in page cache. free_hugepages should be > >>> nr_hugepages minus whatever the amount in use. > >>> 2. after truncated memfd to zero, nr_hugepages should reduced to 63 as > >>> kernel dissolved and freed the HWPoison huge folio. free_hugepages > >>> should also be 63. > >>> > >>> However, when testing at the head of mm-stable commit 2877a83e4a0a > >>> ("mm/hugetlb: use folio->lru int demote_free_hugetlb_folios()"), I found > >>> although free_hugepages is reduced to 63, nr_hugepages is not reduced > >>> and stay at 64. > >>> > >>> Is my expectation outdated? Or is this some kind of bug? > >>> > >>> I assume this is a bug and then digged a little bit more. It seems there > >>> are two issues, or two things I don't really understand. > >>> > >>> 1. During try_memory_failure_hugetlb, we should increased the target > >>> in-use folio's refcount via get_hwpoison_hugetlb_folio. However, > >>> until the end of try_memory_failure_hugetlb, this refcout is not put. > >>> I can make sense of this given we keep in-use huge folio in page > >>> cache. > >> Isn't the general rule that hwpoisoned folios have a raised refcount > >> such that they won't get freed + reused? At least that's how the buddy > >> deals with them, and I suspect also hugetlb? > > Thanks, David. > > > > I see, so it is expected that the _entire_ huge folio will always have > > at least a refcount of 1, even when the folio can become "free". > > > > For *free* huge folio, try_memory_failure_hugetlb dissolves it and > > frees the clean pages (a lot) to the buddy allocator. This made me > > think the same thing will happen for *in-use* huge folio _eventually_ > > (i.e. somehow the refcount due to HWPoison can be put). I feel this is > > a little bit unfortunate for the clean pages, but if it is what it is, > > that's fair as it is not a bug. > > Agreed with David. For *in use* hugetlb pages, including unused shmget > pages, hugetlb shouldn't dissvolve the page, not until an explicit freeing action is taken like > RMID and echo 0 > nr_hugepages. To clarify myself, I am not asking memory-failure.c to dissolve the hugepage at the time it is in-use, but rather when it becomes free (truncated or process exited). > > -jane > > > > >>> [ 1069.320976] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2780000 > >>> [ 1069.320978] head: order:18 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > >>> [ 1069.320980] flags: 0x400000000100044(referenced|head|hwpoison|node=0|zone=1) > >>> [ 1069.320982] page_type: f4(hugetlb) > >>> [ 1069.320984] raw: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000 > >>> [ 1069.320985] raw: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000 > >>> [ 1069.320987] head: 0400000000100044 ffffffff8760bbc8 ffffffff8760bbc8 0000000000000000 > >>> [ 1069.320988] head: 0000000000000000 0000000000000000 00000001f4000000 0000000000000000 > >>> [ 1069.320990] head: 0400000000000012 ffffdd53de000001 ffffffffffffffff 0000000000000000 > >>> [ 1069.320991] head: 0000000000040000 0000000000000000 00000000ffffffff 0000000000000000 > >>> [ 1069.320992] page dumped because: track hwpoison folio's ref > >>> > >>> 2. Even if folio's refcount do drop to zero and we get into > >>> free_huge_folio, it is not clear to me which part of free_huge_folio > >>> is handling the case that folio is HWPoison. In my test what I > >>> observed is that evantually the folio is enqueue_hugetlb_folio()-ed. > >> How would we get a refcount of 0 if we assume the raised refcount on a > >> hwpoisoned hugetlb folio? > >> > >> I'm probably missing something: are you saying that you can trigger a > >> hwpoisoned hugetlb folio to get reallocated again, in upstream code? > > No, I think it is just my misunderstanding. From what you said, the > > expectation of HWPoison hugetlb folio is just it won't get reallocated > > again, which is true. > > > > My (wrong) expectation is, in addition to the "won't reallocated > > again" part, some (large) portion of the huge folio will be freed to > > the buddy allocator. On the other hand, is it something worth having / > > improving? (1G - some_single_digit * 4KB) seems to be valuable to the > > system, though they are all 4K. #1 and #2 above are then what needs to > > be done if the improvement is worth chasing. > > > >> > >> -- > >> Cheers, > >> > >> David / dhildenb > >>