(2011/10/06 10:34), Christian Kujau wrote: > On Wed, 5 Oct 2011 at 20:03, Jan Kara wrote: >>> With Miklos' patches applied to -rc5, this happend again just now :-( >>> >> Thanks for careful testing! Hmm, since you are able to reproduce on ppc >> but not on x86 there might be some memory ordering bug in Miklos' patches >> or it's simply because of different timing. Miklos, care to debug this >> further? > > Just to be clear: I'm still not entirely sure how to reproduce this at > will. I *assumed* that the daily remount-rw-and-ro-again routine that left > some inodes in limbo and eventually lead to those "unprocessed orphan > inodes". With that in mind I tried to reproduce this with the help of a > test-script (test-remount.sh, [0]) - but the message did not occur while > the script was running. > > I've ran the script again today on the said powerpc machine on a > loop-mounted 500MB ext4 partition. But even after 100 iterations no > such message occured. > > So maybe it's caused by something else or my test-script just doesn't get > the scenario right and there's something subtle to this whole > remounting-business I haven't figured out yet, leading to those orphan > inodes. > > I'm at 3.1.0-rc9 now and will wait until the errors occur again. > > Christian. > > [0] nerdbynature.de/bits/3.1-rc4/ext4/ With Miklos' patches applies to -rc8, I could display "Couldn't remount RDWR because of unprocessed orphan inode list". on my x86_64 machine by my reproducer. Because actual removal starts from over a range between mnt_want_write() and mnt_drop_write() even if do_unlinkat() or do_rmdir() calls mnt_want_write() and mnt_drop_write() to prevent a filesystem from re-mounting read-only. My reproducer is as follows: ----------------------------------------------------------------------------- [1] go.sh #!/bin/sh dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1000k > /dev/null 2>&1 /sbin/mkfs.ext4 -Fq /tmp/img mount -o loop /tmp/img /mnt ./writer.sh /mnt & LOOP=1000000000 for ((i=0; i<LOOP; i++)); do echo "[$i]" if ((i%2 == 0)); then mount -o ro,remount,loop /mnt else mount -o rw,remount,loop /mnt fi sleep 1 done [2] writer.sh #!/bin/sh dir=$1 for ((i=0;i<10000000;i++)); do for ((j=0;j<64;j++)); do filename="$dir/file$((i*64 + j))" dd if=/dev/zero of=$filename bs=1k count=8 > /dev/null 2>&1 & done for ((j=0;j<64;j++)); do filename="$dir/file$((i*64 + j))" rm -f $filename > /dev/null 2>&1 & done wait if ((i%100 == 0 && i > 0)); then rm -f $dir/file* fi done exit [step to run] # ./go.sh ----------------------------------------------------------------------------- Therefore, we need a mechanism to prevent a filesystem from re-mounting read-only until actual removal finishes. ------------------------------------------------------------------------ [example fix] do_unlinkat() { ... mnt_want_write() vfs_unlink() if (inode && inode->i_nlink == 0) { // atomic_inc(&inode->i_sb->s_unlink_count); // inode->i_deleting++; // } // mnt_drop_write() ... iput() // usually, an acutal removal starts ... } destroy_inode() { ... if (inode->i_deleting) atomic_dec(&inode->i_sb->s_unlink_count); ... } do_remount_sb() { ... else if (!fs_may_remount_ro(sb) || atomic_read(&sb->s_unlink_count) return -EBUSY; ... } ------------------------------------------------------------------------ Besides, my reproducer also detects the following message: "Ext4-fs (xxx): ext4_da_writepages: jbd2_start: xxx pages, ino xx: err -30" This is because ext4_remount() cannot guarantee to write all ext4 filesystem data out due to the delayed allocation feature. (ext4_da_writepages() fails after ext4_remount() sets MS_RDONLY with sb->s_flags) Therefore, we must write all delayed allocation buffers out before ext4_remount() sets sb->s_flags with MS_RDONLY. ------------------------------------------------------------------------ [example fix] // This requires Miklos' patches. ext4_remount() { ... if (*flags & MS_RDONLY) { err = dquot_suspend(sb, -1); if (err < 0) goto restore_opts; sync_filesystem(sb); // write all delayed buffers out sb->s_flags |= MS_RDONLY; ... } ------------------------------------------------------------------------ Best Regards, Toshiyuki Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html