[PATCH v6 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size

"Pankaj Raghav (Samsung)" <kernel@xxxxxxxxxxxxxxxx> · Wed, 29 May 2024 15:45:05 +0200

From: Pankaj Raghav <p.raghav@xxxxxxxxxxx>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@xxxxxxxxxxx>
---

After disucssing a bit in LSFMM about this, it was clear that using a
PMD sized zero folio might not be a good idea[0], especially in platforms
with 64k base page size, the huge zero folio can be as high as
512M just for zeroing small block sizes in the direct IO path.

The idea to use iomap_init to allocate 64k zero buffer was suggested by
Dave Chinner as it gives decent tradeoff between memory usage and efficiency.

This is a good enough solution for now as moving beyond 64k block size
in XFS might take a while. We can work on a more generic solution in the
future to offer different sized zero folio that can go beyond 64k.

[0] https://lore.kernel.org/linux-fsdevel/ZkdcAsENj2mBHh91@xxxxxxxxxxxxxxxxxxxx/

 fs/internal.h          | 8 ++++++++
 fs/iomap/buffered-io.c | 5 +++++
 fs/iomap/direct-io.c   | 9 +++++++--
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 84f371193f74..18eedbb82c50 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 		get_block_t *get_block, const struct iomap *iomap);
 
+/*
+ * iomap/buffered-io.c
+ */
+
+#define ZERO_FSB_SIZE (65536)
+#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
+extern struct page *zero_fs_block;
+
 /*
  * char_dev.c
  */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c5802a459334..2c0149c827cd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,6 +42,7 @@ struct iomap_folio_state {
 };
 
 static struct bio_set iomap_ioend_bioset;
+struct page *zero_fs_block;
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
 		struct iomap_folio_state *ifs)
@@ -1998,6 +1999,10 @@ EXPORT_SYMBOL_GPL(iomap_writepages);
 
 static int __init iomap_init(void)
 {
+	zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
+	if (!zero_fs_block)
+		return -ENOMEM;
+
 	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
 			   offsetof(struct iomap_ioend, io_bio),
 			   BIOSET_NEED_BVECS);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..50c2bca8a347 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 		loff_t pos, unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
-	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
+	/*
+	 * Max block size supported is 64k
+	 */
+	WARN_ON_ONCE(len > ZERO_FSB_SIZE);
+
 	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	__bio_add_page(bio, zero_fs_block, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.34.1