On 1/10/24 3:15 PM, Muhammad Usama Anjum wrote: > On 1/10/24 11:49 AM, Muhammad Usama Anjum wrote: >> On 1/6/24 2:13 AM, Jiaqi Yan wrote: >>> On Thu, Jan 4, 2024 at 10:27 PM Muhammad Usama Anjum >>> <usama.anjum@xxxxxxxxxxxxx> wrote: >>>> >>>> Hi, >>>> >>>> I'm trying to convert this test to TAP as I think the failures sometimes go >>>> unnoticed on CI systems if we only depend on the return value of the >>>> application. I've enabled the following configurations which aren't already >>>> present in tools/testing/selftests/mm/config: >>>> CONFIG_MEMORY_FAILURE=y >>>> CONFIG_HWPOISON_INJECT=m >>>> >>>> I'll send a patch to add these configs later. Right now I'm trying to >>>> investigate the failure when we are trying to inject the poison page by >>>> madvise(MADV_HWPOISON). I'm getting device busy every single time. The test >>>> fails as it doesn't expect any business for the hugetlb memory. I'm not >>>> sure if the poison handling code has issues or test isn't robust enough. >>>> >>>> ./hugetlb-read-hwpoison >>>> Write/read chunk size=0x800 >>>> ... HugeTLB read regression test... >>>> ... ... expect to read 0x200000 bytes of data in total >>>> ... ... actually read 0x200000 bytes of data in total >>>> ... HugeTLB read regression test...TEST_PASSED >>>> ... HugeTLB read HWPOISON test... >>>> [ 9.280854] Injecting memory failure for pfn 0x102f01 at process virtual >>>> address 0x7f28ec101000 >>>> [ 9.282029] Memory failure: 0x102f01: huge page still referenced by 511 >>>> users >>>> [ 9.282987] Memory failure: 0x102f01: recovery action for huge page: Failed >>>> ... !!! MADV_HWPOISON failed: Device or resource busy >>>> ... HugeTLB read HWPOISON test...TEST_FAILED >>>> >>>> I'm testing on v6.7-rc8. Not sure if this was working previously or not. >>> >>> Thanks for reporting this, Usama! >>> >>> I am also able to repro MADV_HWPOISON failure at "501a06fe8e4c >>> (akpm/mm-stable, mm-stable) zswap: memcontrol: implement zswap >>> writeback disabling." >>> >>> Then I checked out the earliest commit "ba91e7e5d15a (HEAD -> Base) >>> selftests/mm: add tests for HWPOISON hugetlbfs read". The >>> MADV_HWPOISON injection works and and the test passes: >>> >>> ... HugeTLB read HWPOISON test... >>> ... ... expect to read 0x101000 bytes of data in total >>> ... !!! read failed: Input/output error >>> ... ... actually read 0x101000 bytes of data in total >>> ... HugeTLB read HWPOISON test...TEST_PASSED >>> ... HugeTLB seek then read HWPOISON test... >>> ... ... init val=4 with offset=0x102000 >>> ... ... expect to read 0xfe000 bytes of data in total >>> ... ... actually read 0xfe000 bytes of data in total >>> ... HugeTLB seek then read HWPOISON test...TEST_PASSED >>> ... >>> >>> [ 2109.209225] Injecting memory failure for pfn 0x3190d01 at process >>> virtual address 0x7f75e3101000 >>> [ 2109.209438] Memory failure: 0x3190d01: recovery action for huge >>> page: Recovered >>> ... >>> >>> I think something in between broken MADV_HWPOISON on hugetlbfs, and we >>> should be able to figure it out via bisection (and of course by >>> reading delta commits between them, probably related to page >>> refcount). >> Thank you for this information. >> >>> >>> That being said, I will be on vacation from tomorrow until the end of >>> next week. So I will get back to this after next weekend. Meanwhile if >>> you want to go ahead and bisect the problematic commit, that will be >>> very much appreciated. >> I'll try to bisect and post here if I find something. > Found the culprit commit by bisection: > > a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3 > mm/filemap: remove hugetlb special casing in filemap.c #regzbot title: hugetlbfs hwpoison handling #regzbot introduced: a08c7193e4f1 #regzbot monitor: https://lore.kernel.org/all/20240111191655.295530-1-sidhartha.kumar@xxxxxxxxxx > > hugetlb-read-hwpoison started failing from this patch. I've added the > author of this patch to this bug report. > >> >>> >>> Thanks, >>> Jiaqi >>> >>> >>>> >>>> Regards, >>>> Usama >>>> > -- BR, Muhammad Usama Anjum