On 27.10.21 09:02, Peter Xu wrote: > On Wed, Oct 27, 2021 at 03:45:13PM +0900, Naoya Horiguchi wrote: >> On Wed, Oct 27, 2021 at 10:09:03AM +0800, Peter Xu wrote: >>> On Wed, Oct 27, 2021 at 08:27:36AM +0900, Naoya Horiguchi wrote: >>>> On Mon, Oct 04, 2021 at 11:32:28PM +0900, Naoya Horiguchi wrote: >>>>> On Mon, Oct 04, 2021 at 01:55:30PM +0200, David Hildenbrand wrote: >>>>>> On 04.10.21 13:50, Naoya Horiguchi wrote: >>>> ... >>>>>>> >>>>>>> Hwpoison entry for hugepage is also exposed by this patch. The below >>>>>>> example shows how pagemap is visible in the case where a memory error >>>>>>> hit a hugepage mapped to a process. >>>>>>> >>>>>>> $ ./page-types --no-summary --pid $PID --raw --list --addr 0x700000000+0x400 >>>>>>> voffset offset len flags >>>>>>> 700000000 12fa00 1 ___U_______Ma__H_G_________________f_______1 >>>>>>> 700000001 12fa01 1ff ___________Ma___TG_________________f_______1 >>>>>>> 700000200 12f800 1 __________B________X_______________f______w_ >>>>>>> 700000201 12f801 1 ___________________X_______________f______w_ // memory failure hit this page >>>>>>> 700000202 12f802 1fe __________B________X_______________f______w_ >>>>>>> >>>>>>> The entries with both of "X" flag (hwpoison flag) and "w" flag (swap >>>>>>> flag) are considered as hwpoison entries. So all pages in 2MB range >>>>>>> are inaccessible from the process. We can get actual error location >>>>>>> by page-types in physical address mode. >>>>>>> >>>>>>> $ ./page-types --no-summary --addr 0x12f800+0x200 --raw --list >>>>>>> offset len flags >>>>>>> 12f800 1 __________B_________________________________ >>>>>>> 12f801 1 ___________________X________________________ >>>>>>> 12f802 1fe __________B_________________________________ >>>>>>> >>>>>>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> >>>>>>> --- >>>>>>> fs/proc/task_mmu.c | 41 ++++++++++++++++++++++++++++++++--------- >>>>>>> include/linux/swapops.h | 13 +++++++++++++ >>>>>>> tools/vm/page-types.c | 7 ++++++- >>>>>>> 3 files changed, 51 insertions(+), 10 deletions(-) >>>>>> >>>>>> >>>>>> Please also update the documentation located at >>>>>> >>>>>> Documentation/admin-guide/mm/pagemap.rst >>>>> >>>>> I will do this in the next post. >>>> >>>> Reading the document, I found that swap type is already exported so we >>>> could identify hwpoison entry with it (without new PM_HWPOISON bit). >>>> One problem is that the format of swap types (like SWP_HWPOISON) depends >>>> on a few config macros like CONFIG_DEVICE_PRIVATE and CONFIG_MIGRATION, >>>> so we also need to export how the swap type field is interpreted. >>> >>> I had similar question before.. though it was more on the generic swap entries >>> not the special ones yet. >>> >>> The thing is I don't know how the userspace could interpret normal swap device >>> indexes out of reading pagemap, say if we have two swap devices with "swapon >>> -s" then I've no idea how do we know which device has which swap type index >>> allocated. That seems to be a similar question asked above on special swap >>> types - the interface seems to be incomplete, if not unused at all. >>> >>> AFAIU the information on "this page is swapped out to device X on offset Y" is >>> not reliable too, because the pagein/pageout from kernel is transparent to the >>> userspace and not under control of userspace at all. IOW, if the user reads >>> that swap entry, then reads data upon the disk of that offset out and put it >>> somewhere else, then it means the data read could already be old if kernel >>> paged in the page after userspace reading the pagemap but before it reading the >>> disk, and I don't see any way to make it right unless the userspace could stop >>> the kernel from page-in a swap entry. That's why I really wonder whether we >>> should expose normal swap entry at all, as I don't know how it could be helpful >>> and used in the 100% right way. >> >> Thank you for the feedback. >> >> I think that a process interested in controlling swap-in/out behavior in its own >> typically calls mincore() to get current status and madvise() to trigger swap-in/out. >> That's not 100% solution for the same reason, but it mostly works well because >> calling madvise(MADV_PAGEOUT) to already swapped out is not a big issue (although >> some CPU/memory resource is wasted, but the amount of the waste is small if the >> returned info is new enough). >> So my point is that the concern around information newness might be more generic >> issue rather than just for pagemap. If we need 100% accurate in-kernel info, >> maybe it had better be done in kernel (or some cooler stuff like eBPF)? > > I fully agree the solution you mentioned with mincore() and madvise(), that is > very sane and working approach. Though IMHO the major thing I wanted to point > out is for generic swap devices we exposed (disk_index, disk_offset) tuple as > the swap entry (besides "whether this page is swapped out or not"; that's > PM_SWAP, and as you mentioned people'll need to rely on mincore() to make it > right for shmem), though to use it we need to either record the index/offset or > read/write data from it. However none of them will make sense, IMHO.. So I > think exposing PM_SWAP makes sense, not the swap entries on swap devices. > >> >>> >>> Special swap entries seem a bit different - at least for is_pfn_swap_entry() >>> typed swap entries we can still expose the PFN which might be helpful, which I >>> can't tell. >> >> I'm one who think it helpful for testing, although I know testing might not be >> considered as a real usecase. > > I think testing is valid use case too. > >> >>> >>> I used to send an email to Matt Mackall <mpm@xxxxxxxxxxx> and Dave Hansen >>> <dave.hansen@xxxxxxxxxxxxxxx> asking about above but didn't get a reply. Ccing >>> again this time with the list copied. >>> >>>> >>>> I thought of adding new interfaces for example under /sys/kernel/mm/swap/type_format/, >>>> which shows info like below (assuming that all CONFIG_{DEVICE_PRIVATE,MIGRATION,MEMORY_FAILURE} >>>> is enabled): >>>> >>>> $ ls /sys/kernel/mm/swap/type_format/ >>>> hwpoison >>>> migration_read >>>> migration_write >>>> device_write >>>> device_read >>>> device_exclusive_write >>>> device_exclusive_read >>>> >>>> $ cat /sys/kernel/mm/swap/type_format/hwpoison >>>> 25 >>>> >>>> $ cat /sys/kernel/mm/swap/type_format/device_write >>>> 28 >>>> >>>> Does it make sense or any better approach? >>> >>> Then I'm wondering whether we care about the rest of the normal swap devices >>> too with pagemap so do we need to expose some information there too (only if >>> there's a real use case, though..)? Or... should we just don't expose swap >>> entries at all, at least generic swap entries? We can still expose things like >>> hwpoison via PM_* bits well defined in that case. >> >> I didn't think about normal swap devices for no reason. I'm OK to stop exposing >> normal swap device part. I don't have strong option yet about which approach >> (in swaptype or PM_HWPOISON) I'll suggest next (so wait a little more for feedback). > > No strong opinion here too. It's just that the new interface proposed reminded > me that it's partially complete if considering we're also exposing swap entries > on swap devices, so the types didn't cover those entries. However it's more > like a pure question because I never figured out how those entries will work > anyway. I'd be willing to know whether Dave Hanson would comment on this. > > While the PM_HWPOISON approach looks always sane to me. I consider that somehow cleaner, because how HWPOISON entries are implemented ("fake swap entries") is somewhat an internal implementation detail. (I also agree that PM_SWAP makes sense, but maybe really only when we're actually dealing with something that has been/is currently being swapped out. Maybe we should just not expose fake swap entries via PM_SWAP and instead use proper PM_ types for that. PM_MIGRATION, PM_HWPOISON, ...) -- Thanks, David / dhildenb