On Thu, Mar 18, 2021 at 7:18 PM ruansy.fnst@xxxxxxxxxxx <ruansy.fnst@xxxxxxxxxxx> wrote: > > > > > -----Original Message----- > > From: ruansy.fnst@xxxxxxxxxxx <ruansy.fnst@xxxxxxxxxxx> > > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure() > > > > > > > > > > > > > > After the conversation with Dave I don't see the point of this. > > > > > > > If there is a memory_failure() on a page, why not just call > > > > > > > memory_failure()? That already knows how to find the inode and > > > > > > > the filesystem can be notified from there. > > > > > > > > > > > > We want memory_failure() supports reflinked files. In this > > > > > > case, we are not able to track multiple files from a page(this > > > > > > broken > > > > > > page) because > > > > > > page->mapping,page->index can only track one file. Thus, I > > > > > > page->introduce this > > > > > > ->memory_failure() implemented in pmem driver, to call > > > > > > ->->corrupted_range() > > > > > > upper level to upper level, and finally find out files who are > > > > > > using(mmapping) this page. > > > > > > > > > > > > > > > > I know the motivation, but this implementation seems backwards. > > > > > It's already the case that memory_failure() looks up the > > > > > address_space associated with a mapping. From there I would expect > > > > > a new 'struct address_space_operations' op to let the fs handle > > > > > the case when there are multiple address_spaces associated with a given > > file. > > > > > > > > > > > > > Let me think about it. In this way, we > > > > 1. associate file mapping with dax page in dax page fault; > > > > > > I think this needs to be a new type of association that proxies the > > > representation of the reflink across all involved address_spaces. > > > > > > > 2. iterate files reflinked to notify `kill processes signal` by the > > > > new address_space_operation; > > > > 3. re-associate to another reflinked file mapping when unmmaping > > > > (rmap qeury in filesystem to get the another file). > > > > > > Perhaps the proxy object is reference counted per-ref-link. It seems > > > error prone to keep changing the association of the pfn while the reflink is > > in-tact. > > Hi, Dan > > > > I think my early rfc patchset was implemented in this way: > > - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping, > > offset) when causing dax page fault. > > - Mount this tree on page->zone_device_data which is not used in fsdax, so > > that we can iterate reflinked file mappings in memory_failure() easily. > > In my understanding, the dax-rmap tree is the proxy object you mentioned. If > > so, I have to say, this method was rejected. Because this will cause huge > > overhead in some case that every dax page have one dax-rmap tree. > > > > Hi, Dan > > How do you think about this? I am still confused. Could you give me some advice? So I think the primary driver of this functionality is dax-devices and the architectural model for memory failure where several architectures and error handlers know how to route pfn failure to the memory_failure() frontend. Compare that to block-devices where sector failure has no similar framework, and despite some initial interest about reusing 'struct badblocks' for this type of scenario there has been no real uptake to expand 'struct badblocks' outside of the pmem driver. I think the work you have done for ->corrupted_range() just needs to be repurposed away from a block-device operation to dax-device infrastructure. Christoph's pushback on extending block_device_operations makes sense to me because there is likely no other user of this facility than the pmem driver, and the pmem driver only needs it for the vestigial reason that filesystems mount on block-devices and not dax-devices. Recently Dave drove home the point that a filesystem can't do anything with pfns, it needs LBAs. A dax-device does not have LBA's, but it does operate on the concept of device-relative offsets. The filesystem is allowed to assume that dax-device:PFN[device_byte_offset >> PAGE_SHIFT] aliases the same data as the associated block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also reiterated that this interface should be range based, which you already had, but I did not include in my attempt to communicate the mass failure of an entire surprise-removed device. So I think the path forward is: - teach memory_failure() to allow for ranged failures - let interested drivers register for memory failure events via a blocking_notifier_head - teach memory_failure() to optionally let the notifier chain claim the event vs its current default of walking page->mapping - teach the pmem driver to register for memory_failure() events and filter the ones that apply to pfns that the driver owns - drop the nfit driver's usage of the mce notifier chain since memory_failure() is a superset of what the mce notifier communicates - augment the pmem driver's view of badblocks that it gets from address range scrub with one's it gets from memory_failure() events - when pmem handles a memory_failure() event or an address range scrub event fire a new event on a new per-dax-device blocking_notifier_head indicating the dax-relative offset ranges of the translated PFNs. This notification can optionally indicate failure, offline (for removal), and online (for repaired ranges). - teach dm to receive dax-device notifier events from its leaf devices and then translate them into dax-device notifications relative to the dm-device offset. This would seem to be a straightforward conversion of what you have done with ->corrupted_range() - teach filesystems to register for dax-device notifiers With all of that in place an interested filesystem can take ownership of a memory failure that impacts a range of pfns it is responsible for via a dax-device, but it also allows a not interested filesystem to default to standard single-pfn-at-a-time error handling and assumptions about page->mapping only referring to a single address space. This obviously does not solve Dave's desire to get this type of error reporting on block_devices, but I think there's nothing stopping a parallel notifier chain from being created for block-devices, but that's orthogonal to requirements and capabilities provided by dax-devices.