On Thu, Jan 11, 2024 at 12:48 AM Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> wrote: > > On 1/11/24 7:32 AM, Sidhartha Kumar wrote: > > On 1/10/24 2:15 AM, Muhammad Usama Anjum wrote: > >> On 1/10/24 11:49 AM, Muhammad Usama Anjum wrote: > >>> On 1/6/24 2:13 AM, Jiaqi Yan wrote: > >>>> On Thu, Jan 4, 2024 at 10:27 PM Muhammad Usama Anjum > >>>> <usama.anjum@xxxxxxxxxxxxx> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> I'm trying to convert this test to TAP as I think the failures > >>>>> sometimes go > >>>>> unnoticed on CI systems if we only depend on the return value of the > >>>>> application. I've enabled the following configurations which aren't > >>>>> already > >>>>> present in tools/testing/selftests/mm/config: > >>>>> CONFIG_MEMORY_FAILURE=y > >>>>> CONFIG_HWPOISON_INJECT=m > >>>>> > >>>>> I'll send a patch to add these configs later. Right now I'm trying to > >>>>> investigate the failure when we are trying to inject the poison page by > >>>>> madvise(MADV_HWPOISON). I'm getting device busy every single time. The > >>>>> test > >>>>> fails as it doesn't expect any business for the hugetlb memory. I'm not > >>>>> sure if the poison handling code has issues or test isn't robust enough. > >>>>> > >>>>> ./hugetlb-read-hwpoison > >>>>> Write/read chunk size=0x800 > >>>>> ... HugeTLB read regression test... > >>>>> ... ... expect to read 0x200000 bytes of data in total > >>>>> ... ... actually read 0x200000 bytes of data in total > >>>>> ... HugeTLB read regression test...TEST_PASSED > >>>>> ... HugeTLB read HWPOISON test... > >>>>> [ 9.280854] Injecting memory failure for pfn 0x102f01 at process > >>>>> virtual > >>>>> address 0x7f28ec101000 > >>>>> [ 9.282029] Memory failure: 0x102f01: huge page still referenced by > >>>>> 511 > >>>>> users > >>>>> [ 9.282987] Memory failure: 0x102f01: recovery action for huge > >>>>> page: Failed > >>>>> ... !!! MADV_HWPOISON failed: Device or resource busy > >>>>> ... HugeTLB read HWPOISON test...TEST_FAILED > >>>>> > >>>>> I'm testing on v6.7-rc8. Not sure if this was working previously or not. > >>>> > >>>> Thanks for reporting this, Usama! > >>>> > >>>> I am also able to repro MADV_HWPOISON failure at "501a06fe8e4c > >>>> (akpm/mm-stable, mm-stable) zswap: memcontrol: implement zswap > >>>> writeback disabling." > >>>> > >>>> Then I checked out the earliest commit "ba91e7e5d15a (HEAD -> Base) > >>>> selftests/mm: add tests for HWPOISON hugetlbfs read". The > >>>> MADV_HWPOISON injection works and and the test passes: > >>>> > >>>> ... HugeTLB read HWPOISON test... > >>>> ... ... expect to read 0x101000 bytes of data in total > >>>> ... !!! read failed: Input/output error > >>>> ... ... actually read 0x101000 bytes of data in total > >>>> ... HugeTLB read HWPOISON test...TEST_PASSED > >>>> ... HugeTLB seek then read HWPOISON test... > >>>> ... ... init val=4 with offset=0x102000 > >>>> ... ... expect to read 0xfe000 bytes of data in total > >>>> ... ... actually read 0xfe000 bytes of data in total > >>>> ... HugeTLB seek then read HWPOISON test...TEST_PASSED > >>>> ... > >>>> > >>>> [ 2109.209225] Injecting memory failure for pfn 0x3190d01 at process > >>>> virtual address 0x7f75e3101000 > >>>> [ 2109.209438] Memory failure: 0x3190d01: recovery action for huge > >>>> page: Recovered > >>>> ... > >>>> > >>>> I think something in between broken MADV_HWPOISON on hugetlbfs, and we > >>>> should be able to figure it out via bisection (and of course by > >>>> reading delta commits between them, probably related to page > >>>> refcount). > >>> Thank you for this information. > >>> > >>>> > >>>> That being said, I will be on vacation from tomorrow until the end of > >>>> next week. So I will get back to this after next weekend. Meanwhile if > >>>> you want to go ahead and bisect the problematic commit, that will be > >>>> very much appreciated. > >>> I'll try to bisect and post here if I find something. > >> Found the culprit commit by bisection: > >> > >> a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3 > >> mm/filemap: remove hugetlb special casing in filemap.c Thanks Usama! > >> > >> hugetlb-read-hwpoison started failing from this patch. I've added the > >> author of this patch to this bug report. > >> > > Hi Usama, > > > > Thanks for pointing this out. After debugging, the below diff seems to fix > > the issue and allows the tests to pass again. Could you test it on your > > configuration as well just to confirm. > > > > Thanks, > > Sidhartha > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index 36132c9125f9..3a248e4f7e93 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -340,7 +340,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, > > struct iov_iter *to) > > } else { > > folio_unlock(folio); > > > > - if (!folio_test_has_hwpoisoned(folio)) > > + if (!folio_test_hwpoison(folio)) Sidhartha, just curious why this change is needed? Does PageHasHWPoisoned change after commit "a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3"? > > want = nr; > > else { > > /* > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index d8c853b35dbb..87f6bf7d8bc1 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -973,7 +973,7 @@ struct page_state { > > static bool has_extra_refcount(struct page_state *ps, struct page *p, > > bool extra_pins) > > { > > - int count = page_count(p) - 1; > > + int count = page_count(p) - folio_nr_pages(page_folio(p)); > > > > if (extra_pins) > > count -= 1; > > > Tested the patch, it fixes the test. Please send this patch. > > Tested-by: Muhammad Usama Anjum <usama.anjum@xxxxxxxxxxxxx> > > -- > BR, > Muhammad Usama Anjum