On Wed, Dec 19, 2012 at 02:04:45AM +0100, Matthias Schniedermeyer wrote: > On 19.12.2012 07:29, Dave Chinner wrote: > > On Tue, Dec 18, 2012 at 01:34:38AM +0100, Matthias Schniedermeyer wrote: > > > On 18.12.2012 10:24, Dave Chinner wrote: > > > > > > > > diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c > > > > index 9500caf..7bf85e8 100644 > > > > --- a/fs/xfs/xfs_sync.c > > > > +++ b/fs/xfs/xfs_sync.c > > > > @@ -400,7 +400,7 @@ xfs_sync_worker( > > > > * cancel_delayed_work_sync on this work queue before tearing down > > > > * the ail and the log in xfs_log_unmount. > > > > */ > > > > - if (!(mp->m_super->s_flags & MS_ACTIVE) && > > > > + if ((mp->m_super->s_flags & MS_ACTIVE) && > > > > !(mp->m_flags & XFS_MOUNT_RDONLY)) { > > > > /* dgc: errors ignored here */ > > > > if (mp->m_super->s_writers.frozen == SB_UNFROZEN && > > > > > > > > > > > > > > This also appears to fix the other case. > > > When the activity ceases sharply and the log is still not written after > > > minutes. > > > > > > After writing 10 files, waiting a minute, yanking ... all 10 files where > > > there. > > > So the OP-case MIGHT have been this same error. > > > But that's the amateuer talking again. > > > > I kinda deserved that, didn't I? ;) > > > > But now I understand the problem, I agree with you that the OP was > > probably seeing the same bug. I understand the cause, and can > > explain exactly how it would cause both sets of symptoms reported... > > Great. > > That means less lost time in the future, when a USB-disc "decides" to go > MIA. > The record was about 45 minutes lost, or something over 200GB just > going up in smoke (without the smoke). > > At least until such a bug is reintroduced in the future. > This Bug was introduced in 3.5(*) and existed up to 3.7 and if i > understand you correctly was fixed more or less by accident for 3.8. ..... > git blame dates the lines at 2012-05-21 Sure, and the review discussion of this very change identified the whole startup/shutdown code as needing to be rewritten to fix a systemic problems that we'd been patching repeatedly: http://oss.sgi.com/archives/xfs/2012-06/msg00064.html IOWs, while the bug was introduced by accident due to the code being nasty and difficult to follow, it wasn't fixed by accident. It was fixed as a result of review, reflection and redesign - three fundamental properties of the iterative engineering process we use here... > I'd say there is definitely something amiss in the test-suite, this is > basic functionality that appears untested to me. (I don't know what the > test-suite contains, only that it exists) This failure behaviour is definitely tested - the tests that were triggering all the unmount problems that we were trying to fix were the ones that simulate filesystem shutdowns and check that the data is still there after the filesystem is remounted. i.e. very similar tests to what you are running here. The thing is, no matter how much testing we do, there are always cases that will elude us. In this case, it's the fact that you have to slowly create files and write lots of data to trigger the problem. If you create enough files quickly enough, then the log gets written automatically, and the problem you are seeing is not exposed. Many of these tests were written before delayed logging existed, so file creations were comparitively slow compared to now, so the balance of tests changes as we change the kernel code. That is, a test that once may have exposed this problem due to the timing of operations may not expose it now because the timing is different. Hence what we once tested may not get exercised effectively now. Once we identify issues like this, we generally write new tests to reproduce and cover the issue... That's the reality of the situation - testing is never going to be 100% reliable, even if we have tests that specifically cover behaviour like this.... > At least i'd count a dropped connection or power failure (The only > difference is that in the latter case the cache MAY get dropped, > otherwise i'd say both cases are basically the same) among the basic > functionality that should be assured by a journaling fileystem. A journalling filesystem doesn't guarantee that you won't lose any data on crash, power fail or permanent IO errors. All journalling guarantees is that the filesystem is *consistent* after recovery. i.e. you don't have to run xfs_repair after such a failure to ensure it is not corrupted. In all your testing, you have not seen the filesystem become corrupted, so the journalling has fulfilled it's guarantee. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs