On Sun, Dec 06, 2009 at 03:42:59PM +0100, Andrea Gelmini wrote: > Hi all, > I need some advice about this regression... > > Short version: > With 2.6.32 my / partition (ext4 without journal) seems working > always in synchronous mode. No write caching, so the HD is always > working. It works as usual with 2.6.31. > > Long version: > To replicate the problem I've used the bash script in attachment (test.sh) > With 2.6.31.6 I have no work at all of the disk and incredible > speed results of dd (~400 MB/s), of course. > With 2.6.32 I have the HD always working, and real speed numbers of > dd (~30MB/s). > # bad: [5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10] ext4: Fix the alloc on close after a truncate hueristic > git bisect bad 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10 Yeah, this was actually deliberate. The problem is that there are badly written application programs that update files in place via the following pattern: 1. fd = open("file", O_RDONLY); 2. read(fd, buf, bufsize); // read in the file 3. close(fd); // Let the user edit the file // Now the user requests the file be saved out to disk 4. fd = open("file", O_WRONLY | O_TRUNC); 5. write(fd, buf, bufsize); 6. close(fd) The problem is what happens if the system crashes between step 4 and step 5? Especially if "file" is the user's research for his Ph.D. thesis, for which he has spent 10 years collecting, but never bothered to make a backup? (Well, one could argue that the grad student doesn't *deserve* a Ph.D., but maybe it's a Ph.D. in English Literature. :-) So the correct way for an editor to write precious files is as follows: 4. fd = open("file.new", O_WRONLY | O_TRUNC); 5. err = write(fd, buf, bufsize); // ... and check error return from write() 6. err = fsync(fd); // ... and check error return from fsync() 6. err = close(fd); // ... and check error return from close() 7. rename("file.new", "file"); The problem is made especially worse because of delayed allocation, because with delayed allocation, the new data blocks do not get written for potentially 1-2 minutes, but the truncation from opening the file with O_TRUNC() will get written to the file system sooner than that. So because there are a lot of sucky applications out there, and the application writers tend to massively outnumber file system developers, we have placed these hueristics to force an implied fsync() on close() if the file descriptor resulted in data blocks getting truncated, either from an O_TRUNC or an explicit call to ftruncate(2) system call. So your test script exercises this hueristic: for f in $(seq 5) do dd if=/dev/zero of=test.dd bs=100M count=1 done If you change it to be as follows: for f in $(seq 5) do rm -f test.dd dd if=/dev/zero of=test.dd bs=100M count=1 done It will avoid the hueristic from triggering. Or, you can suppress the hueristic via the mount option "noauto_da_alloc". Note that if you do this, and you edit a file using a buggy application that doesn't use fsync(), you may end up losing data on a crash. I'm surprised that you are seeing this situation in actual practice (as opposed to a test script). Are you regularly overwriting huge files via truncate(2) or open with O_TRUNC? And are you doing this assuming that you really don't care about the previous contents of the file after a crash? Most of the time the files that get edited this way tend to be small files. (For example KDE had a bug where *every* *single* *KDE* *dot* *file* was getting rewritten all the time, and users were getting cranky that after their buggy Nvidia proprietary binary drivers crashed their system, all of the windows positions that they had spent hours and hours setting up had vanished.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html