Re: [PATCH v1] mm, pagemap: expose hwpoison entry

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 27, 2021 at 03:45:13PM +0900, Naoya Horiguchi wrote:
> On Wed, Oct 27, 2021 at 10:09:03AM +0800, Peter Xu wrote:
> > On Wed, Oct 27, 2021 at 08:27:36AM +0900, Naoya Horiguchi wrote:
> > > On Mon, Oct 04, 2021 at 11:32:28PM +0900, Naoya Horiguchi wrote:
> > > > On Mon, Oct 04, 2021 at 01:55:30PM +0200, David Hildenbrand wrote:
> > > > > On 04.10.21 13:50, Naoya Horiguchi wrote:
> > > ...
> > > > > >
> > > > > > Hwpoison entry for hugepage is also exposed by this patch. The below
> > > > > > example shows how pagemap is visible in the case where a memory error
> > > > > > hit a hugepage mapped to a process.
> > > > > >
> > > > > >      $ ./page-types --no-summary --pid $PID --raw --list --addr 0x700000000+0x400
> > > > > >      voffset offset  len     flags
> > > > > >      700000000       12fa00  1       ___U_______Ma__H_G_________________f_______1
> > > > > >      700000001       12fa01  1ff     ___________Ma___TG_________________f_______1
> > > > > >      700000200       12f800  1       __________B________X_______________f______w_
> > > > > >      700000201       12f801  1       ___________________X_______________f______w_   // memory failure hit this page
> > > > > >      700000202       12f802  1fe     __________B________X_______________f______w_
> > > > > >
> > > > > > The entries with both of "X" flag (hwpoison flag) and "w" flag (swap
> > > > > > flag) are considered as hwpoison entries.  So all pages in 2MB range
> > > > > > are inaccessible from the process.  We can get actual error location
> > > > > > by page-types in physical address mode.
> > > > > >
> > > > > >      $ ./page-types --no-summary --addr 0x12f800+0x200 --raw --list
> > > > > >      offset  len     flags
> > > > > >      12f800  1       __________B_________________________________
> > > > > >      12f801  1       ___________________X________________________
> > > > > >      12f802  1fe     __________B_________________________________
> > > > > >
> > > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
> > > > > > ---
> > > > > >   fs/proc/task_mmu.c      | 41 ++++++++++++++++++++++++++++++++---------
> > > > > >   include/linux/swapops.h | 13 +++++++++++++
> > > > > >   tools/vm/page-types.c   |  7 ++++++-
> > > > > >   3 files changed, 51 insertions(+), 10 deletions(-)
> > > > >
> > > > >
> > > > > Please also update the documentation located at
> > > > >
> > > > > Documentation/admin-guide/mm/pagemap.rst
> > > >
> > > > I will do this in the next post.
> > > 
> > > Reading the document, I found that swap type is already exported so we
> > > could identify hwpoison entry with it (without new PM_HWPOISON bit).
> > > One problem is that the format of swap types (like SWP_HWPOISON) depends
> > > on a few config macros like CONFIG_DEVICE_PRIVATE and CONFIG_MIGRATION,
> > > so we also need to export how the swap type field is interpreted.
> > 
> > I had similar question before.. though it was more on the generic swap entries
> > not the special ones yet.
> > 
> > The thing is I don't know how the userspace could interpret normal swap device
> > indexes out of reading pagemap, say if we have two swap devices with "swapon
> > -s" then I've no idea how do we know which device has which swap type index
> > allocated.  That seems to be a similar question asked above on special swap
> > types - the interface seems to be incomplete, if not unused at all.
> > 
> > AFAIU the information on "this page is swapped out to device X on offset Y" is
> > not reliable too, because the pagein/pageout from kernel is transparent to the
> > userspace and not under control of userspace at all.  IOW, if the user reads
> > that swap entry, then reads data upon the disk of that offset out and put it
> > somewhere else, then it means the data read could already be old if kernel
> > paged in the page after userspace reading the pagemap but before it reading the
> > disk, and I don't see any way to make it right unless the userspace could stop
> > the kernel from page-in a swap entry.  That's why I really wonder whether we
> > should expose normal swap entry at all, as I don't know how it could be helpful
> > and used in the 100% right way.
> 
> Thank you for the feedback.
> 
> I think that a process interested in controlling swap-in/out behavior in its own
> typically calls mincore() to get current status and madvise() to trigger swap-in/out.
> That's not 100% solution for the same reason, but it mostly works well because
> calling madvise(MADV_PAGEOUT) to already swapped out is not a big issue (although
> some CPU/memory resource is wasted, but the amount of the waste is small if the
> returned info is new enough). 
> So my point is that the concern around information newness might be more generic
> issue rather than just for pagemap.  If we need 100% accurate in-kernel info,
> maybe it had better be done in kernel (or some cooler stuff like eBPF)?

I fully agree the solution you mentioned with mincore() and madvise(), that is
very sane and working approach.  Though IMHO the major thing I wanted to point
out is for generic swap devices we exposed (disk_index, disk_offset) tuple as
the swap entry (besides "whether this page is swapped out or not"; that's
PM_SWAP, and as you mentioned people'll need to rely on mincore() to make it
right for shmem), though to use it we need to either record the index/offset or
read/write data from it.  However none of them will make sense, IMHO..  So I
think exposing PM_SWAP makes sense, not the swap entries on swap devices.

> 
> > 
> > Special swap entries seem a bit different - at least for is_pfn_swap_entry()
> > typed swap entries we can still expose the PFN which might be helpful, which I
> > can't tell.
> 
> I'm one who think it helpful for testing, although I know testing might not be
> considered as a real usecase.

I think testing is valid use case too.

> 
> > 
> > I used to send an email to Matt Mackall <mpm@xxxxxxxxxxx> and Dave Hansen
> > <dave.hansen@xxxxxxxxxxxxxxx> asking about above but didn't get a reply. Ccing
> > again this time with the list copied.
> > 
> > > 
> > > I thought of adding new interfaces for example under /sys/kernel/mm/swap/type_format/,
> > > which shows info like below (assuming that all CONFIG_{DEVICE_PRIVATE,MIGRATION,MEMORY_FAILURE}
> > > is enabled):
> > > 
> > >   $ ls /sys/kernel/mm/swap/type_format/
> > >   hwpoison
> > >   migration_read
> > >   migration_write
> > >   device_write
> > >   device_read
> > >   device_exclusive_write
> > >   device_exclusive_read
> > >   
> > >   $ cat /sys/kernel/mm/swap/type_format/hwpoison
> > >   25
> > >   
> > >   $ cat /sys/kernel/mm/swap/type_format/device_write
> > >   28
> > > 
> > > Does it make sense or any better approach?
> > 
> > Then I'm wondering whether we care about the rest of the normal swap devices
> > too with pagemap so do we need to expose some information there too (only if
> > there's a real use case, though..)?  Or... should we just don't expose swap
> > entries at all, at least generic swap entries?  We can still expose things like
> > hwpoison via PM_* bits well defined in that case.
> 
> I didn't think about normal swap devices for no reason. I'm OK to stop exposing
> normal swap device part.  I don't have strong option yet about which approach
> (in swaptype or PM_HWPOISON) I'll suggest next (so wait a little more for feedback).

No strong opinion here too. It's just that the new interface proposed reminded
me that it's partially complete if considering we're also exposing swap entries
on swap devices, so the types didn't cover those entries.  However it's more
like a pure question because I never figured out how those entries will work
anyway.  I'd be willing to know whether Dave Hanson would comment on this.

While the PM_HWPOISON approach looks always sane to me.

Thanks,

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux