Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2009-12-01, at 09:03, Jan Kara wrote:
On Tue 01-12-09 15:35:59, Jan Kara wrote:
On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
This test case is distilled from an actual application which doesn't even intentionally use writev: it just uses C++'s ofstream class to write data to a file. Unfortunately, that class smart and uses writev under the covers. Unfortunately, I guess nobody ever tests linux writev behavior, since it's broken _so_much_of_the_time_. I really am quite astounded to see such a bad track record for such a fundamental core system call....

I suspect an excellent way of exposing problems with the writev() interface would be to wire it into fsx, which is commonly run as a stress test for Linux. I don't know if it would have caught this case, but it definitely couldn't hurt to get more testing cycles for it.

Ext4 also has this problem but delayed allocation mitigates the effect to an error in accounting of blocks reserved for delayed allocation and thus under normal circumstances nothing bad happens.

It looks like ext4 might still hit this problem, if delalloc is disabled. Could you please submit a similar patch for ext4 also.

 The patch below fixes the issue for me...

									Honza

From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Tue, 1 Dec 2009 16:53:06 +0100
Subject: [PATCH] ext3: Fix data / filesystem corruption when write fails to copy data

When ext3_write_begin fails after allocating some blocks or generic_perform_write fails to copy data to write, we truncate blocks already instantiated beyond i_size. Although these blocks were never inside i_size, we have to truncate pagecache of these blocks so that corresponding buffers get unmapped. Otherwise subsequent __block_prepare_write (called because we are retrying the write) will find the buffers mapped, not call ->get_block, and thus the page will be backed by already freed blocks leading to filesystem and data corruption.

Signed-off-by: Jan Kara <jack@xxxxxxx>
---
fs/ext3/inode.c |   18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 354ed3b..f9d6937 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1151,6 +1151,16 @@ static int do_journal_get_write_access(handle_t *handle,
	return ext3_journal_get_write_access(handle, bh);
}

+/*
+ * Truncate blocks that were not used by write. We have to truncate the + * pagecache as well so that corresponding buffers get properly unmapped.
+ */
+static void ext3_truncate_failed_write(struct inode *inode)
+{
+	truncate_inode_pages(inode->i_mapping, inode->i_size);
+	ext3_truncate(inode);
+}
+
static int ext3_write_begin(struct file *file, struct address_space *mapping,
				loff_t pos, unsigned len, unsigned flags,
				struct page **pagep, void **fsdata)
@@ -1209,7 +1219,7 @@ write_begin_failed:
		unlock_page(page);
		page_cache_release(page);
		if (pos + len > inode->i_size)
-			ext3_truncate(inode);
+			ext3_truncate_failed_write(inode);
	}
	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
		goto retry;
@@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file *file,
	page_cache_release(page);

	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
	return ret ? ret : copied;
}

@@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct file *file,
	page_cache_release(page);

	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
	return ret ? ret : copied;
}

@@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct file *file,
	page_cache_release(page);

	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
	return ret ? ret : copied;
}

--
1.6.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux- ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux