From: Darrick J. Wong <djwong@xxxxxxxxxx> Our current "advice" to people using persistent memory and FSDAX who wish to recover upon receipt of a media error (aka 'hwpoison') event from ACPI is to punch-hole that part of the file and then pwrite it, which will magically cause the pmem to be reinitialized and the poison to be cleared. Punching doesn't make any sense at all -- the (re)allocation on pwrite does not permit the caller to specify where to find blocks, which means that we might not get the same pmem back. This pushes the user farther away from the goal of reinitializing poisoned memory and leads to complaints about unnecessary file fragmentation. AFAICT, the only reason why the "punch and write" dance works at all is that the XFS and ext4 currently call blkdev_issue_zeroout when allocating pmem ahead of a write call. Even a regular overwrite won't clear the poison, because dax_direct_access is smart enough to bail out on poisoned pmem, but not smart enough to clear it. To be fair, that function maps pages and has no idea what kinds of reads and writes the caller might want to perform. Therefore, create a dax_zeroinit_range function that filesystems can call from fallocate ZERO RANGE requests to reset the pmem contents to zero and clear hardware media error flags, and hook it up to XFS. Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> --- fs/dax.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_file.c | 20 ++++++++++++++ include/linux/dax.h | 7 +++++ 3 files changed, 99 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index da41f9363568..fdd7b94f34f0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1742,3 +1742,75 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, return dax_insert_pfn_mkwrite(vmf, pfn, order); } EXPORT_SYMBOL_GPL(dax_finish_sync_fault); + +static loff_t +dax_zeroinit_actor(struct inode *inode, loff_t pos, loff_t length, void *data, + struct iomap *iomap, struct iomap *srcmap) +{ + sector_t sector = iomap_sector(iomap, pos); + int ret; + + if (!iomap->bdev) + return -ECANCELED; + + /* Must be able to zero storage directly without fs intervention. */ + if (iomap->flags & IOMAP_F_SHARED) + return -ECANCELED; + if (srcmap != iomap) + return -ECANCELED; + + switch (iomap->type) { + case IOMAP_MAPPED: + ret = blkdev_issue_zeroout(iomap->bdev, sector, + length >> SECTOR_SHIFT, GFP_KERNEL, 0); + if (ret) + return ret; + fallthrough; + case IOMAP_UNWRITTEN: + return length; + } + + /* Reject holes, inline data, or delalloc extents. */ + return -ECANCELED; +} + +/* + * Initialize storage mapped to a DAX-mode file to a known value and ensure the + * media are ready to accept read and write commands. This requires the use of + * the block layer's zeroout function (with zero-page fallback enabled) to + * write zeroes to a pmem region and to reset any hardware media error state. + * + * The range arguments must be aligned to sector size. The file must be backed + * by a block device. The extents returned must not require copy on write (or + * any other mapping interventions from the filesystem) and must be contiguous. + * @done will be set to true if the reset succeeded. + */ +int +dax_zeroinit_range(struct inode *inode, loff_t pos, loff_t len, bool *done, + const struct iomap_ops *ops) +{ + loff_t ret; + + if (!IS_DAX(inode)) + return -EINVAL; + if ((pos | len) & (SECTOR_SIZE - 1)) + return -EINVAL; + if (pos + len > i_size_read(inode)) + return -EINVAL; + + while (len > 0) { + ret = iomap_apply(inode, pos, len, IOMAP_REPORT, ops, NULL, + dax_zeroinit_actor); + if (ret == -ECANCELED) + return 0; + if (ret < 0) + return ret; + + pos += ret; + len -= ret; + } + + *done = true; + return 0; +} +EXPORT_SYMBOL_GPL(dax_zeroinit_range); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index cc3cfb12df53..e77793820cf3 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -956,6 +956,25 @@ xfs_file_fallocate( goto out_unlock; } + /* + * If the file is in DAX mode, try to use a DAX-specific function to + * zero the region. We can fall back to punch-and-realloc if necessary. + */ + if ((mode & FALLOC_FL_ZERO_RANGE) && IS_DAX(inode)) { + bool did_zeroout = false; + + trace_xfs_zero_file_space(ip); + + error = dax_zeroinit_range(inode, offset, len, &did_zeroout, + &xfs_read_iomap_ops); + if (error == -EINVAL) + error = 0; + if (error) + goto out_unlock; + if (did_zeroout) + goto done; + } + if (mode & FALLOC_FL_PUNCH_HOLE) { error = xfs_free_file_space(ip, offset, len); if (error) @@ -1059,6 +1078,7 @@ xfs_file_fallocate( } } +done: if (file->f_flags & O_DSYNC) flags |= XFS_PREALLOC_SYNC; diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..df52d0ce0ee0 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -152,6 +152,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); dax_entry_t dax_lock_page(struct page *page); void dax_unlock_page(struct page *page, dax_entry_t cookie); +int dax_zeroinit_range(struct inode *inode, loff_t pos, loff_t len, bool *done, + const struct iomap_ops *ops); #else static inline bool bdev_dax_supported(struct block_device *bdev, int blocksize) @@ -201,6 +203,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) { } +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, loff_t len, + bool *done, const struct iomap_ops *ops) +{ + return 0; +} #endif #if IS_ENABLED(CONFIG_DAX)