Re: [PATCH v1] mm, pagemap: expose hwpoison entry

David Hildenbrand <david@xxxxxxxxxx> · Wed, 27 Oct 2021 09:15:35 +0200

On 27.10.21 09:02, Peter Xu wrote:
> On Wed, Oct 27, 2021 at 03:45:13PM +0900, Naoya Horiguchi wrote:
>> On Wed, Oct 27, 2021 at 10:09:03AM +0800, Peter Xu wrote:
>>> On Wed, Oct 27, 2021 at 08:27:36AM +0900, Naoya Horiguchi wrote:
>>>> On Mon, Oct 04, 2021 at 11:32:28PM +0900, Naoya Horiguchi wrote:
>>>>> On Mon, Oct 04, 2021 at 01:55:30PM +0200, David Hildenbrand wrote:
>>>>>> On 04.10.21 13:50, Naoya Horiguchi wrote:
>>>> ...
>>>>>>>
>>>>>>> Hwpoison entry for hugepage is also exposed by this patch. The below
>>>>>>> example shows how pagemap is visible in the case where a memory error
>>>>>>> hit a hugepage mapped to a process.
>>>>>>>
>>>>>>>      $ ./page-types --no-summary --pid $PID --raw --list --addr 0x700000000+0x400
>>>>>>>      voffset offset  len     flags
>>>>>>>      700000000       12fa00  1       ___U_______Ma__H_G_________________f_______1
>>>>>>>      700000001       12fa01  1ff     ___________Ma___TG_________________f_______1
>>>>>>>      700000200       12f800  1       __________B________X_______________f______w_
>>>>>>>      700000201       12f801  1       ___________________X_______________f______w_   // memory failure hit this page
>>>>>>>      700000202       12f802  1fe     __________B________X_______________f______w_
>>>>>>>
>>>>>>> The entries with both of "X" flag (hwpoison flag) and "w" flag (swap
>>>>>>> flag) are considered as hwpoison entries.  So all pages in 2MB range
>>>>>>> are inaccessible from the process.  We can get actual error location
>>>>>>> by page-types in physical address mode.
>>>>>>>
>>>>>>>      $ ./page-types --no-summary --addr 0x12f800+0x200 --raw --list
>>>>>>>      offset  len     flags
>>>>>>>      12f800  1       __________B_________________________________
>>>>>>>      12f801  1       ___________________X________________________
>>>>>>>      12f802  1fe     __________B_________________________________
>>>>>>>
>>>>>>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
>>>>>>> ---
>>>>>>>   fs/proc/task_mmu.c      | 41 ++++++++++++++++++++++++++++++++---------
>>>>>>>   include/linux/swapops.h | 13 +++++++++++++
>>>>>>>   tools/vm/page-types.c   |  7 ++++++-
>>>>>>>   3 files changed, 51 insertions(+), 10 deletions(-)
>>>>>>
>>>>>>
>>>>>> Please also update the documentation located at
>>>>>>
>>>>>> Documentation/admin-guide/mm/pagemap.rst
>>>>>
>>>>> I will do this in the next post.
>>>>
>>>> Reading the document, I found that swap type is already exported so we
>>>> could identify hwpoison entry with it (without new PM_HWPOISON bit).
>>>> One problem is that the format of swap types (like SWP_HWPOISON) depends
>>>> on a few config macros like CONFIG_DEVICE_PRIVATE and CONFIG_MIGRATION,
>>>> so we also need to export how the swap type field is interpreted.
>>>
>>> I had similar question before.. though it was more on the generic swap entries
>>> not the special ones yet.
>>>
>>> The thing is I don't know how the userspace could interpret normal swap device
>>> indexes out of reading pagemap, say if we have two swap devices with "swapon
>>> -s" then I've no idea how do we know which device has which swap type index
>>> allocated.  That seems to be a similar question asked above on special swap
>>> types - the interface seems to be incomplete, if not unused at all.
>>>
>>> AFAIU the information on "this page is swapped out to device X on offset Y" is
>>> not reliable too, because the pagein/pageout from kernel is transparent to the
>>> userspace and not under control of userspace at all.  IOW, if the user reads
>>> that swap entry, then reads data upon the disk of that offset out and put it
>>> somewhere else, then it means the data read could already be old if kernel
>>> paged in the page after userspace reading the pagemap but before it reading the
>>> disk, and I don't see any way to make it right unless the userspace could stop
>>> the kernel from page-in a swap entry.  That's why I really wonder whether we
>>> should expose normal swap entry at all, as I don't know how it could be helpful
>>> and used in the 100% right way.
>>
>> Thank you for the feedback.
>>
>> I think that a process interested in controlling swap-in/out behavior in its own
>> typically calls mincore() to get current status and madvise() to trigger swap-in/out.
>> That's not 100% solution for the same reason, but it mostly works well because
>> calling madvise(MADV_PAGEOUT) to already swapped out is not a big issue (although
>> some CPU/memory resource is wasted, but the amount of the waste is small if the
>> returned info is new enough). 
>> So my point is that the concern around information newness might be more generic
>> issue rather than just for pagemap.  If we need 100% accurate in-kernel info,
>> maybe it had better be done in kernel (or some cooler stuff like eBPF)?
> 
> I fully agree the solution you mentioned with mincore() and madvise(), that is
> very sane and working approach.  Though IMHO the major thing I wanted to point
> out is for generic swap devices we exposed (disk_index, disk_offset) tuple as
> the swap entry (besides "whether this page is swapped out or not"; that's
> PM_SWAP, and as you mentioned people'll need to rely on mincore() to make it
> right for shmem), though to use it we need to either record the index/offset or
> read/write data from it.  However none of them will make sense, IMHO..  So I
> think exposing PM_SWAP makes sense, not the swap entries on swap devices.
> 
>>
>>>
>>> Special swap entries seem a bit different - at least for is_pfn_swap_entry()
>>> typed swap entries we can still expose the PFN which might be helpful, which I
>>> can't tell.
>>
>> I'm one who think it helpful for testing, although I know testing might not be
>> considered as a real usecase.
> 
> I think testing is valid use case too.
> 
>>
>>>
>>> I used to send an email to Matt Mackall <mpm@xxxxxxxxxxx> and Dave Hansen
>>> <dave.hansen@xxxxxxxxxxxxxxx> asking about above but didn't get a reply. Ccing
>>> again this time with the list copied.
>>>
>>>>
>>>> I thought of adding new interfaces for example under /sys/kernel/mm/swap/type_format/,
>>>> which shows info like below (assuming that all CONFIG_{DEVICE_PRIVATE,MIGRATION,MEMORY_FAILURE}
>>>> is enabled):
>>>>
>>>>   $ ls /sys/kernel/mm/swap/type_format/
>>>>   hwpoison
>>>>   migration_read
>>>>   migration_write
>>>>   device_write
>>>>   device_read
>>>>   device_exclusive_write
>>>>   device_exclusive_read
>>>>   
>>>>   $ cat /sys/kernel/mm/swap/type_format/hwpoison
>>>>   25
>>>>   
>>>>   $ cat /sys/kernel/mm/swap/type_format/device_write
>>>>   28
>>>>
>>>> Does it make sense or any better approach?
>>>
>>> Then I'm wondering whether we care about the rest of the normal swap devices
>>> too with pagemap so do we need to expose some information there too (only if
>>> there's a real use case, though..)?  Or... should we just don't expose swap
>>> entries at all, at least generic swap entries?  We can still expose things like
>>> hwpoison via PM_* bits well defined in that case.
>>
>> I didn't think about normal swap devices for no reason. I'm OK to stop exposing
>> normal swap device part.  I don't have strong option yet about which approach
>> (in swaptype or PM_HWPOISON) I'll suggest next (so wait a little more for feedback).
> 
> No strong opinion here too. It's just that the new interface proposed reminded
> me that it's partially complete if considering we're also exposing swap entries
> on swap devices, so the types didn't cover those entries.  However it's more
> like a pure question because I never figured out how those entries will work
> anyway.  I'd be willing to know whether Dave Hanson would comment on this.
> 
> While the PM_HWPOISON approach looks always sane to me.

I consider that somehow cleaner, because how HWPOISON entries are
implemented ("fake swap entries") is somewhat an internal implementation
detail.

(I also agree that PM_SWAP makes sense, but maybe really only when we're
actually dealing with something that has been/is currently being swapped
out. Maybe we should just not expose fake swap entries via PM_SWAP and
instead use proper PM_ types for that. PM_MIGRATION, PM_HWPOISON, ...)

-- 
Thanks,

David / dhildenb