On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > [..] > > > > > Hi Jeff, > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > 4095. > > > > This should fail with EIO. Only full page writes should clear the > > bad page state, and partial writes should therefore fail because > > they do not guarantee the data in the filesystem block is all good. > > > > If this zeroing was a buffered write to an address with a bad > > sector, then the writeback will fail and the user will (eventually) > > get an EIO on the file. > > > > DAX should do the same thing, except because the zeroing is > > synchronous (i.e. done directly by the truncate syscall) we can - > > and should - return EIO immediately. > > > > Indeed, with your code, if we then extend the file by truncating up > > back to 4k, then the range between 23 and 512 is still bad, even > > though we've successfully zeroed it and the user knows it. An > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > will fail with EIO, but reading 10 bytes at offset 2000 will > > succeed. > > Hi Dave, > > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to > 511) is poisoned and rest don't have poison. Should this fail with -EIO. Yes - the filesystem block still contains bad data. > In current implementation it does not. I'm not surprised - the whole hardware error handling architecture for FS-DAX is fundamentally broken. It was designed for device-dax, and it just doesn't work for FS-DAX. For example, to get the hardware error handling to be able to kill userspace applications, a 1:1 physical-to-logical association constraint was added to fs/dax.c: /* * TODO: for reflink+dax we need a way to associate a single page with * multiple address_space instances at different linear_page_index() * offsets. */ static void dax_associate_entry(void *entry, struct address_space *mapping, struct vm_area_struct *vma, unsigned long address) because device-dax only has *linear mappings* and so has a fixed reverse mapping architecture. i.e. the whole hardware error handling infrastructure was designed around the constraints of device-dax. device-dax does not having any structure to serialise access to the physical storage, so locking was added to the mapping tree. THe mapping tree locking is accessed on hardware error via the reverse mappingi association in the struct page and that's how device-dax serialises direct physical storage access against hardware error processing. And while the page index is locked in the mapping tree, it can walk the process vmas that have the page mapped to kill them so that they don't try to access the bad page. That bad physical storage state is carried in a volatile struct page flag, hence requiring some mechanism to make it persistent (the device bad blocks register) and some other mechanism to clear the poison state (direct IO, IIRC). It's also a big, nasty, solid roadblock to implementing shared data extents in FS-DAX. We basically have to completely re-architect the hardware error handling for FS-DAX so that such errors are reported to the filesystem first and then the filesystem does what is needed to handle the error. None of this works for filesystems because they need to perform different operations depending on what the page that went bad contains. FS-DAX should never trip over an unexpected poisoned page; we do so now because such physical storage errors are completely hidden form the fielsystem. What you are trying to do is slap a band-aid over what to do when we hit an unexpected page containing bad data. Filesystems expect to find out about bad data in storage when they marshall the data into or out of memory. They make the assumption that once it is in memory it remains valid on the physical storage. Hence if an in-memory error occurs, we can just toss it away and re-read it from storage, and all is good. FS-DAX changes that - we are no longer marshalling data into and out of memory so we don't have a mechanism to get EIO when reading the page into the page cache or writing it back to disk. We also don't have an in-memory copy of the data - the physical storage is the in-memory copy, and so we can't just toss it away when an error occurs. What we actually require is instantaneous notification of physical storage errors so we can handle the error immediately. And that, in turn, means we should never poison or see poisoned pages during direct access operations because the filesystem doesn't need to poison pages to prevent user access - it controls how the storage is accessed directly. e.g. if this error is in filesystem metadata, we might be able to recover from it as that metadata might have a full copy in memory (metadata is buffered in both XFS and ext4) or we might be able to reconstruct it from other sources. Worst case, we have shut the filesystem down completely so the admin can repair the damage the lost page has caused. e.g. The physical page may be located in free space, in which case we don't care and can just zero it so all the bad hardware state is cleared. The page never goes bad or gets poisoned in that case. e.g. The physical page may be user data, in which case it may be buffered in the page cache (non-dax) and so can easily be recovered. It may not be recoverable, in which case we need to issue log messages indicating that data has been lost (path, offset, length), and do the VMA walk and kill processes that map that page. Then we can zero the page to clear the bad state. If, at any point we can't clear the bad state (e.g. the zeroing or the read-back verification fails), then we need to make sure that filesystem block is marked as allocated in the free space map, then tell the reverse map that it's owner is now "bad storage" so it never gets used again. i.e. this is the persistent bad block storage, but doing it this way results in the rest of the filesystem code never, ever seeing a poisoned page. And users never see it either, because it will never be returned to the free space pool. Of course, this relies of the filesystem having reverse mapping capability. XFS already has this funcitonality available as a mkfs option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can get rid of the association of a physical page with a mapping and file offset that device-dax requires for hardware page error handling. This means we don't need the physical storage to try to hold filesystem layer reverse mapping information for us, and this also removes the roadblock that the hardware error handling has placed on implementing reflink w/ FS-DAX. IOWs, the problem you are trying to solve is a direct result of filesysetms not being informed when a physical pmem page goes bad and the current error handling being implemented at entirely the wrong layer for FS-DAX. It may work for device-dax, but it's most definitely insufficient for correct error handling for filesystems. > Anyway, partial page truncate can't ensure that data in rest of the page can > be read back successfully. Memory can get poison after the write and > hence read after truncate will still fail. Which is where the notification requirement comes in. Yes, we may still get errors on read or write, but if memory at rest goes bad, we want to handle that and correct it ASAP, not wait days or months for something to trip over the poisoned page before we find out about it. > Hence, all we are trying to ensure is that if a poison is known at the > time of writing partial page, then we should return error to user space. I think within FS-DAX infrastructure, any access to the data (read or write) within a poisoned page or a page marked with PageError() should return EIO to the caller, unless it's the specific command to clear the error/poison state on the page. What happens with that error state is then up to the caller. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx