Re: [PATCH v3 0/7] hugetlbfs memory HW error fixes

David Hildenbrand <david@xxxxxxxxxx> · Tue, 3 Dec 2024 15:08:00 +0100

On 03.12.24 01:15, William Roche wrote:
On 12/2/24 17:00, David Hildenbrand wrote:
On 02.12.24 16:41, William Roche wrote:
Hello David,

Hi,

sorry for reviewing yet, I was rather sick the last 1.5 weeks.

I hope you get well soon!

Getting there, thanks! :)

I've finally tested many page mapping possibilities and tried to
identify the error injection reaction on these pages to see if mmap()
can be used to recover the impacted area.
I'm using the latest upstream kernel I have for that:
6.12.0-rc7.master.20241117.ol9.x86_64
But I also got similar results with a kernel not supporting
MADV_DONTNEED, for example: 5.15.0-301.163.5.2.el9uek.x86_64

Let's start with mapping a file without modifying the mapped area:
In this case we should have a clean page cache mapped in the process.
If an error is injected on this page, the kernel doesn't even inform the
process about the error as the page is replaced (no matter if the
mapping was shared of not).

The kernel indicates this situation with the following messages:

[10759.371701] Injecting memory failure at pfn 0x10d88e
[10759.374922] Memory failure: 0x10d88e: corrupted page was clean:
dropped without side effects
[10759.377525] Memory failure: 0x10d88e: recovery action for clean LRU
page: Recovered

Right. The reason here is that we can simply allocate a new page and
load data from disk. No corruption.

Now when the page content is modified, in the case of standard page
size, we need to consider a MAP_PRIVATE or MAP_SHARED
- in the case of a MAP_PRIVATE page, this page is corrupted and the
modified data are lost, the kernel will use the SIGBUS mechanism to
inform this process if needed.
     But remapping the area sweeps away the poisoned page, and allows the
process to use the area.

- In the case of a MAP_SHARED page, if the content hasn't been sync'ed
with the file backend, we also loose the modified data, and the kernel
can also raise SIGBUS.
     Remapping the area recreates a page cache from the "on disk" file
content, clearing the error.

In a mmap(MAP_SHARED, fd) region that will also require fallocate IIUC.

I would have expected the same thing, but what I noticed is that in the
case of !hugetlb, even poisoned shared memory seem to be recovered with:
mmap(location, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0)

Let me take a look at your tool below if I can find an explanation of 
what is happening, because it's weird :)

[...]

At the end of this email, I included the source code of a simplistic
test case that shows that the page is replaced in the case of standard
page size.

The idea of this test is simple:

1/ Create a local FILE with:
# dd if=/dev/zero of=./FILE bs=4k count=2
2+0 records in
2+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000337674 s, 24.3 MB/s

2/ As root run:
# ./poisonedShared4k
Mapping 8192 bytes from file FILE
Reading and writing the first 2 pages content:
Read: Read: Wrote: Initial mem page 0
Wrote: Initial mem page 1
Data pages at 0x7f71a19d6000  physically 0x124fb0000
Data pages at 0x7f71a19d7000  physically 0x128ce4000
Poisoning 4k at 0x7f71a19d6000
Signal 7 received
	code 4		Signal code
	addr 0x7f71a19d6000	Memory location
	si_addr_lsb 12
siglongjmp used
Remapping the poisoned page
Reading and writing the first 2 pages content:
Read: Read: Initial mem page 1
Wrote: Rewrite mem page 0
Wrote: Rewrite mem page 1
Data pages at 0x7f71a19d6000  physically 0x10c367000
Data pages at 0x7f71a19d7000  physically 0x128ce4000

   ---

As we can see, this process:
- maps the FILE,
- tries to read and write the beginning of the first 2 pages
- gives their physical addresses
- poison the first page with a madvise(MADV_HWPOISON) call
- shows the SIGBUS signal received and recovers from it
- simply remaps the same page from the file
- tries again to read and write the beginning of the first 2 pages
- gives their physical addresses

Turns out the code will try to truncate the pagecache page using 
mapping->a_ops->error_remove_folio().

That, however, is only implemented on *some* filesystems.

Most prominently, it is not implemented on shmem as well.

So if you run your test with shmem (e.g., /tmp/FILE), it doesn't work.

Using fallocate+MADV_DONTNEED seems to work on shmem.

--
Cheers,

David / dhildenb