Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

Andreas Dilger <adilger@xxxxxxx> · Tue, 01 Dec 2009 15:02:04 -0700

On 2009-12-01, at 09:03, Jan Kara wrote:
On Tue 01-12-09 15:35:59, Jan Kara wrote:
On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
On Mon, 2009-11-30 at 19:48 -0500, James Y Knight wrote:
On Nov 30, 2009, at 3:55 PM, James Y Knight wrote:
This test case is distilled from an actual application which  
doesn't even intentionally use writev: it just uses C++'s  
ofstream class to write data to a file. Unfortunately, that  
class smart and uses writev under the covers. Unfortunately, I  
guess nobody ever tests linux writev behavior, since it's broken  
_so_much_of_the_time_. I really am quite astounded to see such a  
bad track record for such a fundamental core system call....

I suspect an excellent way of exposing problems with the writev()  
interface would be to wire it into fsx, which is commonly run as a  
stress test for Linux.  I don't know if it would have caught this  
case, but it definitely couldn't hurt to get more testing cycles for it.

 Ext4 also has this problem but delayed allocation mitigates the  
effect to an error in accounting of blocks reserved for delayed  
allocation and thus under normal circumstances nothing bad happens.

It looks like ext4 might still hit this problem, if delalloc is  
disabled.  Could you please submit a similar patch for ext4 also.

 The patch below fixes the issue for me...

									Honza

From 1b2ad411dd86afbfdb3c5b0f913230e9f1f0b858 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Tue, 1 Dec 2009 16:53:06 +0100
Subject: [PATCH] ext3: Fix data / filesystem corruption when write  
fails to copy data

When ext3_write_begin fails after allocating some blocks or  
generic_perform_write fails to copy data to write, we truncate  
blocks already instantiated beyond i_size. Although these blocks  
were never inside i_size, we have to truncate pagecache of these  
blocks so that corresponding buffers get unmapped. Otherwise  
subsequent __block_prepare_write (called because we are retrying the  
write) will find the buffers mapped, not call ->get_block, and thus  
the page will be backed by already freed blocks leading to  
filesystem and data corruption.

Signed-off-by: Jan Kara <jack@xxxxxxx>
---
fs/ext3/inode.c |   18 ++++++++++++++----
1 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 354ed3b..f9d6937 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1151,6 +1151,16 @@ static int  
do_journal_get_write_access(handle_t *handle,
	return ext3_journal_get_write_access(handle, bh);
}

+/*
+ * Truncate blocks that were not used by write. We have to truncate  
the
+ * pagecache as well so that corresponding buffers get properly  
unmapped.
+ */
+static void ext3_truncate_failed_write(struct inode *inode)
+{
+	truncate_inode_pages(inode->i_mapping, inode->i_size);
+	ext3_truncate(inode);
+}
+
static int ext3_write_begin(struct file *file, struct address_space  
*mapping,
				loff_t pos, unsigned len, unsigned flags,
				struct page **pagep, void **fsdata)
@@ -1209,7 +1219,7 @@ write_begin_failed:
		unlock_page(page);
		page_cache_release(page);
		if (pos + len > inode->i_size)
-			ext3_truncate(inode);
+			ext3_truncate_failed_write(inode);
	}
	if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
		goto retry;
@@ -1304,7 +1314,7 @@ static int ext3_ordered_write_end(struct file  
*file,
	page_cache_release(page);

	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
	return ret ? ret : copied;
}

@@ -1330,7 +1340,7 @@ static int ext3_writeback_write_end(struct  
file *file,
	page_cache_release(page);

	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
	return ret ? ret : copied;
}

@@ -1383,7 +1393,7 @@ static int ext3_journalled_write_end(struct  
file *file,
	page_cache_release(page);

	if (pos + len > inode->i_size)
-		ext3_truncate(inode);
+		ext3_truncate_failed_write(inode);
	return ret ? ret : copied;
}

--
1.6.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux- 
ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html