[ add Mauro and Tony for RAS discussion ] On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote: > > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@xxxxxxxxxx> wrote: > > > > > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote: > > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: > > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > > > >>> As the code I pasted before, pmem driver will subtract its ->data_offset, > > > >>> which is byte-based. And the filesystem who implements ->notify_failure() > > > >>> will calculate the offset in unit of byte again. > > > >>> > > > >>> So, leave its function signature byte-based, to avoid repeated conversions. > > > >> > > > >> I'm actually fine either way, so I'll wait for Dan to comment. > > > > > > > > FWIW I'd convinced myself that the reason for using byte units is to > > > > make it possible to reduce the pmem failure blast radius to subpage > > > > units... but then I've also been distracted for months. :/ > > > > > > > > > > Yes, thanks Darrick! I recall that. > > > Maybe just add a comment about why byte unit is used? > > > > I think we start with page failure notification and then figure out > > how to get finer grained through the dax interface in follow-on > > changes. Otherwise, for finer grained error handling support, > > memory_failure() would also need to be converted to stop upcasting > > cache-line granularity to page granularity failures. The native MCE > > notification communicates a 'struct mce' that can be in terms of > > sub-page bytes, but the memory management implications are all page > > based. I assume the FS implications are all FS-block-size based? > > I wouldn't necessarily make that assumption -- for regular files, the > user program is in a better position to figure out how to reset the file > contents. > > For fs metadata, it really depends. In principle, if (say) we could get > byte granularity poison info, we could look up the space usage within > the block to decide if the poisoned part was actually free space, in > which case we can correct the problem by (re)zeroing the affected bytes > to clear the poison. > > Obviously, if the blast radius hits the internal space info or something > that was storing useful data, then you'd have to rebuild the whole block > (or the whole data structure), but that's not necessarily a given. tl;dr: dax_holder_notify_failure() != fs->notify_failure() So I think I see some confusion between what DAX->notify_failure() needs, memory_failure() needs, the raw information provided by the hardware, and the failure granularity the filesystem can make use of. DAX and memory_failure() need to make immediate page granularity decisions. They both need to map out whole pages (in the direct map and userspace respectively) to prevent future poison consumption, at least until the poison is repaired. The event that leads to a page being failed can be triggered by a hardware error as small as an individual cacheline. While that is interesting to a filesystem it isn't information that memory_failure() and DAX can utilize. The reason DAX needs to have a callback into filesystem code is to map the page failure back to all the processes that might have that page mapped because reflink means that page->mapping is not sufficient to find all the affected 'struct address_space' instances. So it's more of an address-translation / "help me kill processes" service than a general failure notification service. Currently when raw hardware event happens there are mechanisms like arch-specific notifier chains, like powerpc::mce_register_notifier() and x86::mce_register_decode_chain(), or other platform firmware code like ghes_edac_report_mem_error() that uplevel the error to a coarse page granularity failure, while emitting the fine granularity error event to userspace. All of this to say that the interface to ask the fs to do the bottom half of memory_failure() (walking affected 'struct address_space' instances and killing processes (mf_dax_kill_procs())) is different than the general interface to tell the filesystem that memory has gone bad relative to a device. So if the only caller of fs->notify_failure() handler is this code: + if (pgmap->ops->memory_failure) { + rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE, + flags); ...then you'll never get fine-grained reports. So, I still think the DAX, pgmap and memory_failure() interface should be pfn based. The interface to the *filesystem* ->notify_failure() can still be byte-based, but the trigger for that byte based interface will likely need to be something driven by another agent. Perhaps like rasdaemon in userspace translating all the arch specific physical address events back into device-relative offsets and then calling a new ABI that is serviced by fs->notify_failure() on the backend.