Re: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Nov 1, 2021 at 1:11 PM Yang Shi <shy828301@xxxxxxxxx> wrote:
On Mon, Nov 1, 2021 at 12:38 PM Jue Wang <juew@xxxxxxxxxx> wrote:
>
> A related bug but whose fix may belong to a separate series:
>
> split_huge_page fails when invoked concurrently on the same THP page.
>
> It's possible that multiple memory errors on the same THP get consumed
> by multiple threads and come down to split_huge_page path easily.

Yeah, I think it should be a known problem since the very beginning.
The THP split requires to pin the page and does check if the refcount
is expected or not and freezes the refcount if it is expected. So if
two concurrent paths try to split the same THP, one will fail due to
the pin from the other path, but the other one will succeed.

The failed thread will result in a -EBUSY from memory_failure and
SIGBUS sent to the process without context (address, BUS_MCEERR_AR).

This is undesirable for applications who intend to recover from memory
errors. 

One possible fix is to recognize such cases and signal properly from
memory_failure.

I don't think of a better way to remediate it other than retrying from
the very start off the top of my head. We can't simply check if it is
still a THP or not since THP split will just move the refcount pin to
the poisoned subpage so the retry path will lose the refcount for its
poisoned subpage.

Did you run into this problem on any real production environment? Or
it is just a artificial test case? I'm wondering if the extra
complexity is worth or not.

This can be easily reproduced in artificial test cases.

I'd not surprised if production environment hits this bug.

>
> Thanks,
> -Jue

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux