Re: XFS write cache flush policy

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 19 Dec 2012 07:20:43 +1100

On Tue, Dec 18, 2012 at 01:26:24AM +0100, Matthias Schniedermeyer wrote:
> On 18.12.2012 10:24, Dave Chinner wrote:
> > On Mon, Dec 17, 2012 at 12:22:51AM +0100, Matthias Schniedermeyer wrote:
> > 
> > > Dec 17 00:13:28 localhost kernel: [  325.562642] lost page write due to I/O error on sdb1
> > > Dec 17 00:13:28 localhost kernel: [  325.562643] Buffer I/O error on device sdb1, logical block 2613285
> > > Dec 17 00:13:28 localhost kernel: [  325.562643] lost page write due to I/O error on sdb1
> > > Dec 17 00:13:28 localhost kernel: [  325.562644] Buffer I/O error on device sdb1, logical block 2613286
> > > Dec 17 00:13:28 localhost kernel: [  325.562644] lost page write due to I/O error on sdb1
> > > Dec 17 00:13:28 localhost kernel: [  325.562645] Buffer I/O error on device sdb1, logical block 2613287
> > > Dec 17 00:13:28 localhost kernel: [  325.562645] lost page write due to I/O error on sdb1
> > > Dec 17 00:13:28 localhost kernel: [  325.562646] Buffer I/O error on device sdb1, logical block 2613288
> > > Dec 17 00:13:28 localhost kernel: [  325.562646] lost page write due to I/O error on sdb1
> > > Dec 17 00:13:28 localhost kernel: [  325.647364] XFS (sdb1): metadata I/O error: block 0x6400022 ("xlog_iodone") error 5 numblks 64
> > > Dec 17 00:13:28 localhost kernel: [  325.648559] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1074 of file /xssd/usr_src/linux/fs/xfs/xfs_log.c.  Return address = 0xffffffff812a1e49
> > > Dec 17 00:13:28 localhost kernel: [  325.649804] XFS (sdb1): Log I/O Error Detected.  Shutting down filesystem
> > 
> > And there's why nothing got written to the log - you pulled the plug
> > before it was flushed. But that also implies it wasn't flushed for a
> > couple of minutes, which is not correct. They should be occurring
> > every 30s when the filesystem is only slightly busy like this.
> > 
> > Hmmmm, that rigns a bell. I think I came across this when doing a
> > RHEL6 backport recently, and decided that nothing needed doing
> > upstream because there were already patches pending to fix it.
> > That is, there is a logic bug is in xfssyncd_worker(), and that was
> > fixed by the xfssyncd-die-die-die patchset that is now in 3.8-rc1.
> 
> One 'bleeding edge' kernel, so new that it isn't even called 3.8-rc1. 
> ;-)
> 
> I would call that successful.

Ok, thank you for testing it :)

> ls -l /mnt/a
> total 10661888
> drwxr-xr-x 2 root root       105 Dec 18 00:55 .
> drwxr-xr-x 3 root root        14 Dec 18 00:53 ..
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 1
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 10
> -rw-r--r-- 1 root root 738263040 Dec 18 00:55 11
> -rw-r--r-- 1 root root 322990080 Dec 18 00:55 12
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 2
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 3
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 4
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 5
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 6
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 7
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 8
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 9
> 
> The 12th file was beeing written while i yanked out the cable. Strange 
> that the 11th file isn't complete, i would expect correct ordering of 
> events. Is this normal/expected behaviour for XFS?

Yes, it is. The file size is not updated until the data is written,
and data from previous files is still being written when the next
file is created and the application starts to write to it. So you
get file size updates for previous files occurring while you are
creating new files, and the result is as you see above - the file
sizes reflect the data that is guarantee to be written to disk...

> The 'ls -l/mnt/a' before i yanked out the cable was this:
> /mnt/a:
> total 11186176
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 1
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 10
> -rw-r--r-- 1 root root 943718400 Dec 18 00:55 11
> -rw-r--r-- 1 root root 665326592 Dec 18 00:55 12
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 2
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 3
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 4
> -rw-r--r-- 1 root root 943718400 Dec 18 00:53 5
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 6
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 7
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 8
> -rw-r--r-- 1 root root 943718400 Dec 18 00:54 9

Yup, makes sense. File sizes in memory indicate that there is about
500MB of dirty data in memory that was lost when you pulled the
plug. That doesn't seem unreasonable for a periodic 30s update.

> > So, two things: firstly, can you check that the oldest file that is
> > lost on 3.8-rc1 is less than 30s old, and secondly, whether this
> 
> I would say: A definite yes

Ok, I'll send a patch for review for the stable kernels.

> Looks the same to me, only different timing in regards to 30s windows.

*nod*

> Do you need any logs or other things?

No, the change of behaviour is sufficient to identify this as the
source of your problem. Thank you for sticking at this long enough
for me ot isolate the bug!

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs