Re: Ext4 without journal write cache problem...

tytso@xxxxxxx · Sun, 6 Dec 2009 17:02:32 -0500

On Sun, Dec 06, 2009 at 03:42:59PM +0100, Andrea Gelmini wrote:
> Hi all,
>    I need some advice about this regression...
> 
>    Short version:
>    With 2.6.32 my / partition (ext4 without journal) seems working
> always in synchronous mode. No write caching, so the HD is always
> working. It works as usual with 2.6.31.
> 
>    Long version:
>    To replicate the problem I've used the bash script in attachment (test.sh)
>    With 2.6.31.6 I have no work at all of the disk and incredible
> speed results of dd (~400 MB/s), of course.
>    With 2.6.32 I have the HD always working, and real speed numbers of
> dd (~30MB/s).

> # bad: [5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10] ext4: Fix the alloc on close after a truncate hueristic
> git bisect bad 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10

Yeah, this was actually deliberate.  The problem is that there are
badly written application programs that update files in place via the
following pattern:

1.  fd = open("file", O_RDONLY);
2.  read(fd, buf, bufsize);	// read in the file
3.  close(fd);
			// Let the user edit the file
			// Now the user requests the file be saved out to disk
4.  fd = open("file", O_WRONLY | O_TRUNC);
5.  write(fd, buf, bufsize);
6.  close(fd)

The problem is what happens if the system crashes between step 4 and
step 5?  Especially if "file" is the user's research for his
Ph.D. thesis, for which he has spent 10 years collecting, but never
bothered to make a backup?  (Well, one could argue that the grad
student doesn't *deserve* a Ph.D., but maybe it's a Ph.D. in English
Literature.  :-)

So the correct way for an editor to write precious files is as
follows:

4.  fd = open("file.new", O_WRONLY | O_TRUNC);
5.  err = write(fd, buf, bufsize); // ... and check error return from write()
6.  err = fsync(fd);		   // ... and check error return from fsync()
6.  err = close(fd);		   // ... and check error return from close()
7.  rename("file.new", "file"); 

The problem is made especially worse because of delayed allocation,
because with delayed allocation, the new data blocks do not get
written for potentially 1-2 minutes, but the truncation from opening
the file with O_TRUNC() will get written to the file system sooner
than that.

So because there are a lot of sucky applications out there, and the
application writers tend to massively outnumber file system
developers, we have placed these hueristics to force an implied
fsync() on close() if the file descriptor resulted in data blocks
getting truncated, either from an O_TRUNC or an explicit call to
ftruncate(2) system call.

So your test script exercises this hueristic:

for f in $(seq 5)
do
        dd if=/dev/zero of=test.dd bs=100M count=1
done

If you change it to be as follows:

for f in $(seq 5)
do
	rm -f test.dd
        dd if=/dev/zero of=test.dd bs=100M count=1
done

It will avoid the hueristic from triggering.

Or, you can suppress the hueristic via the mount option
"noauto_da_alloc".  Note that if you do this, and you edit a file
using a buggy application that doesn't use fsync(), you may end up
losing data on a crash.  

I'm surprised that you are seeing this situation in actual practice
(as opposed to a test script).  Are you regularly overwriting huge
files via truncate(2) or open with O_TRUNC?  And are you doing this
assuming that you really don't care about the previous contents of the
file after a crash?  Most of the time the files that get edited this
way tend to be small files.  (For example KDE had a bug where *every*
*single* *KDE* *dot* *file* was getting rewritten all the time, and
users were getting cranky that after their buggy Nvidia proprietary
binary drivers crashed their system, all of the windows positions that
they had spent hours and hours setting up had vanished.)

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html