On 10/11/2024 12:44 PM, Luck, Tony wrote:
Something like by way of userfaultfd, kernel provides a new/clean
hugetlb page, copied over good data from the clean subpages and then
present the clean hugetlb page to user process with indication that
subpage x is a substitute of the poisoned old subpage x, hence its data
might need a refill? I am not sure how exactly to pull this through as
the even is not a page-fault, but just wondering whether something like
this is possible.
This requires serious levels of sophistication from the application.
If some thread still accesses the "lost" data, there's no signal that
anything went wrong. It just reads whatever data the kernel filled the
poisoned area with. For some applications there might be some
data pattern that would help track this down. But no general answer.
Is it possible to rely on mf_mutex to hold off subsequent threads
accessing the poisoned spot until the 1st poison event has been handled
and page replaced by joint effort of the application and kernel? I mean
until the poisoned page is removed from the page table, other threads
accessing it would hit MCE, right?
On the plus side, the amount of "lost" data need not be a page.
On Intel the poison unit is a cache line (64 bytes). So more of the
original data can potentially be preserved. This might be useful
for applications using regular pages as well as those using huge pages.
That requires the kernel to provide finer grained SIGBUS payload such as
untrimmed vaddr and si_lsb=6.
When Linux first implemented recovery, we had hopes that applications
like databases would be able to implement their own recovery. Losing
a whole page turned out to be problematic as in some implementations
the metadata for a database entry was stored at the start of the memory
block. So the SIGBUS would provide the virtual address, and it wasn't
of any practical use to determine which data structure(s) were affected
without some massive restructure of the code to separate metadata
from data.
-Tony
-jane