On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote: > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote: > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote: > > > it points to, check if it points to the PMEM that is being removed, > > > grab the page it points to, map that to the relevant struct page, > > > run collect_procs() on that page, then kill the user processes that > > > map that page. > > > > > > So why can't we walk the ptescheck the physical pages that they > > > map to and if they map to a pmem page we go poison that > > > page and that kills any user process that maps it. > > > > > > i.e. I can't see how unexpected pmem device unplug is any different > > > to an MCE delivering a hwpoison event to a DAX mapped page. > > > > I guess the tradeoff is walking a long list of inodes vs walking a > > large array of pages. > > Not really. You're assuming all a filesystem has to do is invalidate > everything if a device goes away, and that's not true. Finding if an > inode has a mapping that spans a specific device in a multi-device > filesystem can be a lot more complex than that. Just walking inodes > is easy - determining whihc inodes need invalidation is the hard > part. That inode-to-device level of specificity is not needed for the same reason that drop_caches does not need to be specific. If the wrong page is unmapped a re-fault will bring it back, and re-fault will fail for the pages that are successfully removed. > That's where ->corrupt_range() comes in - the filesystem is already > set up to do reverse mapping from physical range to inode(s) > offsets... Sure, but what is the need to get to that level of specificity with the filesystem for something that should rarely happen in the course of normal operation outside of a mistake? > > > There's likely always more pages than inodes, but perhaps it's more > > efficient to walk the 'struct page' array than sb->s_inodes? > > I really don't see you seem to be telling us that invalidation is an > either/or choice. There's more ways to convert physical block > address -> inode file offset and mapping index than brute force > inode cache walks.... Yes, but I was trying to map it to an existing mechanism and the internals of drop_pagecache_sb() are, in coarse terms, close to what needs to happen here. > > ..... > > > > IOWs, what needs to happen at this point is very filesystem > > > specific. Assuming that "device unplug == filesystem dead" is not > > > correct, nor is specifying a generic action that assumes the > > > filesystem is dead because a device it is using went away. > > > > Ok, I think I set this discussion in the wrong direction implying any > > mapping of this action to a "filesystem dead" event. It's just a "zap > > all ptes" event and upper layers recover from there. > > Yes, that's exactly what ->corrupt_range() is intended for. It > allows the filesystem to lock out access to the bad range > and then recover the data. Or metadata, if that's where the bad > range lands. If that recovery fails, it can then report a data > loss/filesystem shutdown event to userspace and kill user procs that > span the bad range... > > FWIW, is this notification going to occur before or after the device > has been physically unplugged? Before. This will be operations that happen in the pmem driver ->remove() callback. > i.e. what do we do about the > time-of-unplug-to-time-of-invalidation window where userspace can > still attempt to access the missing pmem though the > not-yet-invalidated ptes? It may not be likely that people just yank > pmem nvdimms out of machines, but with NVMe persistent memory > spaces, there's every chance that someone pulls the wrong device... The physical removal aspect is only theoretical today. While the pmem driver has a ->remove() path that's purely a software unbind operation. That said the vulnerability window today is if a process acquires a dax mapping, the pmem device hosting that filesystem goes through an unbind / bind cycle, and then a new filesystem is created / mounted. That old pte may be able to access data that is outside its intended protection domain. Going forward, for buses like CXL, there will be a managed physical remove operation via PCIE native hotplug. The flow there is that the PCIE hotplug driver will notify the OS of a pending removal, trigger ->remove() on the pmem driver, and then notify the technician (slot status LED) that the card is safe to pull.