Bad directories appearing in ext3 after upgrade 2.4.16 -> 2.4.18+cvs

neilb@cse.unsw.edu.au (Neil Brown) · Wed, 22 May 2002 10:50:08 +1000 (EST)

On Tuesday May 21, sct@redhat.com wrote:
> On Tue, May 21, 2002 at 10:48:03AM +0100, Stephen C. Tweedie wrote:
>  
> > >  My ext3 filesystem is on a raid5 array, with the journal on
> > >  a separate raid1 array. (data=journal mode).
> > 
> > Not a configuration I've tested, but I can set up a box here with that
> > config and see if I can reproduce any problems.
> > 
> > >  I get quite a few messages in the logs which say:
> > >   
> > > May 21 14:20:06 glass kernel: raid5: multiple 1 requests for sector 7540536
> 
> I just realised, this implies that it's the main filesystem (ie, NOT
> the journal) which is seeing this.  That's something I can't explain,
> unless something is going wrong with the checkpoint/forget logic on
> file deletes.  Now, there _was_ a change in that code in the latest
> diffs, but that only affected multiple reuse of the same buffer_head,
> not different buffer_heads for the same disk block.

I'm guessing that it is related to checkpoint/forget...

Overnight I had another directory get corrupted.  Content was a
filename (full path, about 40 chars) terminated by a '\n' and then the
rest of the block zero filled.

I have rebooted into a kernel much like I was running before
(2.4.17-pre2 with patches) that has the fix for the assertion failure
in do_get_write_access.

I also get raid5 to report the device sector number (bh->b_rsector)
and to print out the first 32 bytes of the before and after blocks.
25 minutes later I got two messages from raid5 about "multiple 1
requests" (the '1' means 'WRITE', '0' would be 'READ').

I ran "icheck" for both blocks and found two files that had been
created 9 minutes earlier.

One file was a 30K binary file (Macromedia flash data).  The old data 
(i.e. the buffer_head that was written first) was:

Old: 30 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00.......
the new was
New: ffffff8b 5b ffffffd7 ffffffb1 5d ffffffc7 74 67 .....

My guess is that these are blocks of two different files that were
allocated to the same block on disc, because the first was deleted
before the second was created.  The block in the deleted file still
was written to disk by bdflush (or whatever), and then the block
in the new file was written.  Fortuantely bdflush got them in the
right order.   Maybe in 2.4.18 the ordering isn't preserved so well.

The other file was a 2K textfile. The old and new blocks looked
like bits of a KDE configuration file.
May 22 10:07:03 glass kernel:  Old: 5b 44 65 73 6b 74 6f 70 73 5d 0a 4e 61 6d 65 5f 31 3d 0a 4e 61 6d 65 5f 32 3d 0a 4e 61 6d 65 5f
##############                      [  D  e  s  k  t  o  p
May 22 10:07:03 glass kernel:  New: 5b 41 70 70 6c 65 74 5f 31 5d 0a 43 6f 6e 66 69 67 46 69 6c 65 3d 6b 6d 69 6e 69 70 61 67 65 72
##############                      [  A  p  p  l  e  t

I cannot really tell if the correct one was written last.  I hope so.

while typing this, two more messages have popped up:
May 22 10:30:50 glass kernel: raid5: multiple 1 requests for sector 35042944(105129088)
May 22 10:30:50 glass kernel:  Old: 31 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
May 22 10:30:50 glass kernel:  New: 3b 72 75 6e 73 3a 31 2c 32 32 34 2c 33 32 3b 3b 00 53 63 68 65 64 75 6c 65 72 49 6e 66 6f 3a 74
May 22 10:35:35 glass kernel: raid5: multiple 1 requests for sector 34081296(102243856)
May 22 10:35:35 glass kernel:  Old: 31 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
May 22 10:35:35 glass kernel:  New: 00 74 00 65 00 72 00 6e 00 65 00 74 00 73 00 65 00 61 00 72 00 63 00 68 00 08 00 00 00 04 00 00

It is interesting that in thee cases out of 4, the "old" buffer
appeared to be from a small file containing a single line, and infact,
a single digit.  It is fairly easy to believe that a file such as that
could be deliberately short lived.

So: When a file with dirty data is deleted, is there any chance that
that dirty data will still be written to disc, even after the disc
blocks have been allocated to a different files?

> 
> I'm currently running fs stress testing on a 3-disk soft raid5
> data=journal filesystem, with the journal on an external nvram disk,
> to see if I can reproduce this.

If I am right, then a load that 
  - creates lots of files
  - deletes them
  - creates lots more files
  - sync

would be most likely to cause a problem, though A simple version of
that didn't work for me.... maybe I need to force a journal checkpoint
rather than a sync...

NeilBrown

> 
> Cheers,
>  Stephen