Hi Peter, On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote: > I'm just afraid there's no space left for a migration entry, because migration > entries fills in the pfn information into swp offset field rather than a real > offset (please refer to make_migration_entry())? I assume PFN can use any bit. > Or did I miss anything? > > I went back to see the original proposal from Hugh: > > IIUC you only need a single value, no need to carve out another whole > swp_type: could probably be swp_offset 0 of any swp_type other than 0. > > Hugh/Andrea, sorry if this is a stupid swap question: could you help explain > why swp_offset=0 won't be used by any swap device? I believe it's correct, > it's just that I failed to figure out the reason myself. :( > Hugh may want to review if I got it wrong, but there's basically three ways. swp_type would mean adding one more reserved value in addition of SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing SWP_MIGRATION_NUM to 3). swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel (I'd feel safer with pfn value -1UL truncated to the bits of the swp offset, since the swp_entry format is common code). The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP from the pte, one that cannot ever be set in any swp entry today. I assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but you may want to verify it... It'd be set on the pte (not in the swap entry), then it doesn't matter much what's inside the swp_entry anymore. The pte value would be generated with this: pte_swp_uffd_wp_unmap(swp_entry_to_pte(swp_entry(SWP_MIGRATION_READ, 0))) (maybe SWP_MIGRATION_READ could also be 0 and then it can be just enough to set that single bit in the pte and nothing else, all other bits zero) We never store a raw swp entry in the pte (the raw swp entry is stored in the xarray, it's the index of the swapcache). To solve our unmap issue we only deal with pte storage (no xarray index storage). This is why it can also be in the arch specific pte representation of the swp entry, it doesn't need to be a special value defined in the swp entry common code. Being the swap entry to pte conversion arch dependent, such bit needs to be defined by each arch (reserving a offset or type value in swp entry would solve it in the common code). #define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1) All bits below PROTNONE are available for software use and we use bit 1 (soft dirty) 2 (uffd_wp). protnone bit 8 itself (global bit) must not be set or it'll look protnone and pte_present will be true. Bit 7 is PSE so it's also not available because pte_present checks that too. It appears you can pick between bit 3 4 5 6 at your own choice and it doesn't look like we're running out of those yet (if we were there would be a bigger incentive to encode it as part of the swp entry format). Example: #define _PAGE_SWP_UFFD_WP_UNMAP _PAGE_PWT If that bit it set and pte_present is false, then everything else in that that pte is meaningless and it means uffd wrprotected pte_none. So in the migration-entry/swapin page fault path, you could go one step back and check the pte for such bit, if it's set it's not a migration entry. If there's a read access it should fill the page mark with shmem_fault, keep the pte wrprotected and then set _PAGE_UFFD_WP on the pte. If there's a write access it should invoke handle_userfault. If there's any reason where the swp_entry reservation is simpler that's ok too, you'll see an huge lot of more details once you try to implement it so you'll be better able to judje later. I'm greatly simplifying everything but this is not simple feat... Thanks, Andrea