On 18.12.2012 10:24, Dave Chinner wrote: > On Mon, Dec 17, 2012 at 12:22:51AM +0100, Matthias Schniedermeyer wrote: > > > Dec 17 00:13:28 localhost kernel: [ 325.562642] lost page write due to I/O error on sdb1 > > Dec 17 00:13:28 localhost kernel: [ 325.562643] Buffer I/O error on device sdb1, logical block 2613285 > > Dec 17 00:13:28 localhost kernel: [ 325.562643] lost page write due to I/O error on sdb1 > > Dec 17 00:13:28 localhost kernel: [ 325.562644] Buffer I/O error on device sdb1, logical block 2613286 > > Dec 17 00:13:28 localhost kernel: [ 325.562644] lost page write due to I/O error on sdb1 > > Dec 17 00:13:28 localhost kernel: [ 325.562645] Buffer I/O error on device sdb1, logical block 2613287 > > Dec 17 00:13:28 localhost kernel: [ 325.562645] lost page write due to I/O error on sdb1 > > Dec 17 00:13:28 localhost kernel: [ 325.562646] Buffer I/O error on device sdb1, logical block 2613288 > > Dec 17 00:13:28 localhost kernel: [ 325.562646] lost page write due to I/O error on sdb1 > > Dec 17 00:13:28 localhost kernel: [ 325.647364] XFS (sdb1): metadata I/O error: block 0x6400022 ("xlog_iodone") error 5 numblks 64 > > Dec 17 00:13:28 localhost kernel: [ 325.648559] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c. Return address = 0xffffffff812a1e49 > > Dec 17 00:13:28 localhost kernel: [ 325.649804] XFS (sdb1): Log I/O Error Detected. Shutting down filesystem > > And there's why nothing got written to the log - you pulled the plug > before it was flushed. But that also implies it wasn't flushed for a > couple of minutes, which is not correct. They should be occurring > every 30s when the filesystem is only slightly busy like this. > > Hmmmm, that rigns a bell. I think I came across this when doing a > RHEL6 backport recently, and decided that nothing needed doing > upstream because there were already patches pending to fix it. > That is, there is a logic bug is in xfssyncd_worker(), and that was > fixed by the xfssyncd-die-die-die patchset that is now in 3.8-rc1. One 'bleeding edge' kernel, so new that it isn't even called 3.8-rc1. ;-) I would call that successful. ls -l /mnt/a total 10661888 drwxr-xr-x 2 root root 105 Dec 18 00:55 . drwxr-xr-x 3 root root 14 Dec 18 00:53 .. -rw-r--r-- 1 root root 943718400 Dec 18 00:53 1 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 10 -rw-r--r-- 1 root root 738263040 Dec 18 00:55 11 -rw-r--r-- 1 root root 322990080 Dec 18 00:55 12 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 2 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 3 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 4 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 5 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 6 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 7 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 8 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 9 The 12th file was beeing written while i yanked out the cable. Strange that the 11th file isn't complete, i would expect correct ordering of events. Is this normal/expected behaviour for XFS? The 'ls -l/mnt/a' before i yanked out the cable was this: /mnt/a: total 11186176 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 1 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 10 -rw-r--r-- 1 root root 943718400 Dec 18 00:55 11 -rw-r--r-- 1 root root 665326592 Dec 18 00:55 12 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 2 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 3 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 4 -rw-r--r-- 1 root root 943718400 Dec 18 00:53 5 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 6 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 7 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 8 -rw-r--r-- 1 root root 943718400 Dec 18 00:54 9 > So, two things: firstly, can you check that the oldest file that is > lost on 3.8-rc1 is less than 30s old, and secondly, whether this I would say: A definite yes > patch gives the same result as 3.8-rc1 on 3.6/3.7: 3.7.1 + patch /mnt/a: total 8425472 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 1 -rw-r--r-- 1 root root 92282880 Dec 18 01:13 10 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 2 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 3 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 4 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 5 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 6 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 7 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 8 -rw-r--r-- 1 root root 507555840 Dec 18 01:13 9 Looks the same to me, only different timing in regards to 30s windows. This was just before the yanking: /mnt/a: total 11186176 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 1 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 10 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 11 -rw-r--r-- 1 root root 869729280 Dec 18 01:14 12 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 2 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 3 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 4 -rw-r--r-- 1 root root 943718400 Dec 18 01:12 5 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 6 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 7 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 8 -rw-r--r-- 1 root root 943718400 Dec 18 01:13 9 > diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c > index 9500caf..7bf85e8 100644 > --- a/fs/xfs/xfs_sync.c > +++ b/fs/xfs/xfs_sync.c > @@ -400,7 +400,7 @@ xfs_sync_worker( > * cancel_delayed_work_sync on this work queue before tearing down > * the ail and the log in xfs_log_unmount. > */ > - if (!(mp->m_super->s_flags & MS_ACTIVE) && > + if ((mp->m_super->s_flags & MS_ACTIVE) && > !(mp->m_flags & XFS_MOUNT_RDONLY)) { > /* dgc: errors ignored here */ > if (mp->m_super->s_writers.frozen == SB_UNFROZEN && > > Do you need any logs or other things? -- Matthias _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs